pisces: a programmable, protocol-independent …jrex/papers/pisces16.pdfwe say that pisces is a...

PISCES: A Programmable, Protocol-IndependentSoftware Switch

Muhammad Shahbaz?, Sean Choi�, Ben Pfaff†, Changhoon Kim‡,Nick Feamster?, Nick McKeown�, Jennifer Rexford?

?Princeton University �Stanford University †VMware, Inc ‡Barefoot Networks, Inchttp://pisces.cs.princeton.edu

AbstractHypervisors use software switches to steer packets to and fromvirtual machines (VMs). These switches frequently need up-grading and customization—to support new protocol headersor encapsulations for tunneling and overlays, to improve mea-surement and debugging features, and even to add middlebox-like functions. Software switches are typically based on a largebody of code, including kernel code, and changing the switchis a formidable undertaking requiring domain mastery of net-work protocol design and developing, testing, and maintaininga large, complex codebase. Changing how a software switchforwards packets should not require intimate knowledge of itsimplementation. Instead, it should be possible to specify howpackets are processed and forwarded in a high-level domain-specific language (DSL) such as P4, and compiled to run ona software switch. We present PISCES, a software switchderived from Open vSwitch (OVS), a hard-wired hypervisorswitch, whose behavior is customized using P4. PISCES is nothard-wired to specific protocols; this independence makes iteasy to add new features. We also show how the compiler cananalyze the high-level specification to optimize forwardingperformance. Our evaluation shows that PISCES performscomparably to OVS and that PISCES programs are about 40times shorter than equivalent changes to OVS source code.

Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks] Network Architecture and Design;D.2.8 [Software Engineering] Metrics—Complexity Measures;Performance MeasuresGeneral Terms: Design; Languages; PerformanceKeywords: Software-Defined Networks (SDN); Domain-Specific Languages (DSL); P4; Software Switch; OVS; Pro-grammable Data Planes; PISCES; Compiler Optimizations

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected]’16, August 22–26, 2016, Florianopolis, Brazil.Copyright 2016 ACM. ISBN 978-1-4503-4193-6/16/08...$15.00DOI: http://dx.doi.org/10.1145/2934872.2934886

1 Introduction

Software switches, such as Open vSwitch (OVS) [57], play akey role in modern data centers: with few exceptions, everypacket that passes to or from a virtual machine (VM) passesthrough a software switch. In addition, servers greatly out-number physical switches in this environment. Therefore, adata center full of servers running hypervisor software alsocontains far more software switches than hardware switches.Likewise, because each hypervisor hosts several VMs, such adata center has more virtual Ethernet ports than physical ones.

One of the main advantages of a software hypervisor switchis that it can be upgraded more easily than a hardware switch.As a result, hypervisor switches support new encapsulationheaders, improved troubleshooting and debugging features,and middlebox-like functions such as load balancing, addressvirtualization, and encryption. In the future, as data centerowners customize and optimize their infrastructure, they willcontinue to add features to hypervisor switches.

Each new feature requires customizing the hypervisorswitch, yet making these customizations is more difficult thanit may appear. First, most of the machinery that enables fastpacket forwarding resides in the kernel. Writing kernel coderequires domain expertise that most network operators lack,and thus introduces a significant barrier for developing anddeploying new features. Recent technologies can acceler-ate packet forwarding in user space (e.g., DPDK [34] andNetmap [64]), but these technologies still require significantsoftware development expertise and intimate familiarity witha large, intricate, and complex codebase. Furthermore, cus-tomization requires not only incorporating changes into switchcode, but also maintaining these customizations as the under-lying software evolves over time, which can require significantresources.

Changing how a software switch forwards packets shouldnot require intimate knowledge of how the switch is imple-mented. Rather, it should be possible to specify custom net-work protocols in a domain-specific language (DSL) such asP4 [10], which is then compiled to custom code for the hy-pervisor switch. Such a DSL would support customizing theforwarding behavior of the switch, without requiring changesto the underlying switch implementation. Decoupling customprotocol implementations from underlying switch code alsomakes it easier to maintain these customizations, since they

http://pisces.cs.princeton.edu

http://dx.doi.org/10.1145/2934872.2934886

remain independent of the underlying switch implementation.With a standardized DSL, customizations may also be portedto other hardware or software switches, that support the samelanguage.

A key insight, borrowed from a similar trend in hardwareswitches [11, 41], is that the underlying switch should be asubstrate, well-tuned to process packets at high speed, but nottied to a specific protocol. In the extreme, the switch is saidto be “protocol independent,” meaning that before it receivesinstructions about how to process packets (via a DSL), it doesnot know what a protocol is. Put another way, protocols arerepresented by programs written in the DSL, which protocolauthors create.

We apply a similar philosophy to software switches. Weassume the program written in the DSL specifies which packetheaders to parse and the structure of the match-action tables(i.e., which header fields to match and which actions to per-form on matching headers). The underlying software substrateis a generic engine, optimized to parse, match, and act uponthe packet headers in the form the program specifies.

Expressing these customizations in a DSL, however, entailscompilation from the DSL to code that runs in the switch.Compared to a switch that is handwritten to implement fixedprotocols, this protocol compilation process may reduce theefficiency of the underlying implementation and thus comeat the cost of performance. The compilation process differsfrom hardware switches where, given limited resources, theobjective is to optimize for metrics like area, latency, andpower, while satisfying resource constraints [36]. Our goals inthis paper are to (1) quantify the additional cost that expressingcustom protocols in such a DSL produces; and (2) design andevaluate domain-specific compiler optimizations that reducethe performance overhead as much as possible. Ultimately, wedemonstrate that, with the appropriate compiler optimizations,the performance of a protocol-independent software switch—aswitch that supports custom protocol specification in a high-level DSL without direct modifications to the low-level sourcecode—approaches parity with the native hypervisor softwareswitch. Our results are promising, particularly given thatOVS, our base code, was not designed to support protocolindependence. Nevertheless, our results demonstrate that the“cost of programmability” in hypervisor switches is negligible.We expect our results will inspire the design of new protocol-independent software switches running at even higher speeds.

We make the following contributions:

• The design and implementation of PISCES, the first soft-ware switch that allows custom protocol specification ina high-level DSL, without requiring direct modificationsto switch source code (Section 4).

• A public, open-source implementation of PISCESon GitHub [2]. The implementation is a protocol-independent software switch derived from OVS that isprogrammed from a high-level DSL, called P4.

• Domain-specific optimizations and a back-end optimizerto reduce the performance overhead of customizing OVS

using P4. We also introduce two new annotations in P4to aid in the optimizations (Section 4.3).

• An evaluation of the code complexity of PISCES pro-grams and its forwarding performance (Section 5). Ourevaluation shows that PISCES programs are on averageabout 40 times shorter than equivalent changes to OVSsource code and incur a forwarding performance (i.e.,throughput) overhead of only about 2%.

We begin by motivating the need for a customizable hypervi-sor software switch with a description of real use cases fromoperational networks (Section 2) and present background in-formation on both P4 and OVS (Section 3).

2 The Need for a Protocol-Independent Switch

We say that PISCES is a protocol-independent software switchbecause it does not know what a protocol is or how to processpackets on behalf of a protocol, until the programmer specifiesit. For example, if we want PISCES to process IPv4 packets,then we need to describe how IPv4 packets are processed ina P4 program. In a P4 program (e.g., IPv4.p4), we need todescribe the format and fields of the IPv4 header, including theIP addresses, protocol ID, TTL, checksum, flags, and so forth.We also need to specify that we use a lookup table to storeIPv4 prefixes, and that we search for the longest matchingprefix. We also need to describe how a TTL is decremented, achecksum is updated, and so on. The P4 program captures theentire packet processing pipeline, which is compiled to sourcecode for OVS that specifies the switch’s match, action, andparse capabilities.

A protocol-independent switch brings many benefits:

Adding new standard or private protocol headers. Ven-dors propose new protocol headers all the time, particularlyfor data centers. In recent years, VXLAN [47], NVGRE [73],Geneve [29] have all been standardized, and STT [16] andNSH [60] are also being discussed as potential standards.Private, proprietary protocols are also added, to provide acompetitive advantage by, for example, creating better isola-tion between applications, or by introducing novel congestionmarking. In many cases, before new protocols can be deployed,all hardware and software switches must be upgraded to rec-ognize the headers and process them correctly. For hardwareswitches, the data center owner must provide requirementsto their chip vendor and wait three to four years for the newfeature to arrive, if the vendor agrees to add the feature at all.In the case of software switches, they must wait for the nextmajor revision, testing, and deployment cycle. Even modify-ing an open-source software switch is not a panacea becauseonce the data center owner directly modifies the open-sourcesoftware switches to add their own custom protocols, thesemodifications still need to be maintained and synchronizedwith the mainline codebase, introducing significant code main-tenance overhead as the original open-source switch continuesto evolve. A data-center owner who could add new protocols

to a P4 program could, instead, compile and deploy a newprotocol more quickly.

Removing a standard protocol header. Data-center net-works typically run fewer protocols than legacy campus andenterprise networks, in part because most of the traffic ismachine-to-machine and many legacy protocols are not needed(e.g., multicast, RSVP, L2-learning). For example, AmazonWeb Services (AWS) reportedly only forwards packets usingIPv4 headers [55]. It therefore benefits the data-center ownerto remove unused protocols entirely, thus eliminating any con-cern of interactions with dormant implementations of legacyprotocols. It is bad enough to have to support many proto-cols; much worse to have to understand interactions with andimplications of protocols that operators do not intend to use.Therefore, data-center owners frequently want to eliminateunused protocols from their switches, NICs, and operatingsystems. Removing protocols from conventional switches isdifficult; for hardware, it means waiting for new silicon, andfor software switches it means wrestling with a large code-base to extract a specific protocol. In PISCES, removing anunused protocol is as simple as removing unused portions of aprotocol specification and recompiling the switch source code.(Section 5.2.2 shows how this can even improve performance.)

Adding better visibility. As data centers get larger and areused by more applications, it becomes important to understandthe network’s behavior and operating conditions. Failures canlead to huge loss in revenue, exacerbated by long debuggingtimes as the network gets bigger and more complicated. Thereis growing interest in making it easier to see what the networkis doing. Improving network visibility might entail supportingnew statistics, generating new probe packets, or adding newprotocols and actions to collect switch state (as is enabledby in-band network telemetry [42, 43]). Users will want tosee how queues are evolving, latencies are varying, whethertunnels are correctly terminated, and whether links are stillup. Often, during an emergency, users want to quickly addvisibility features. Having them ready to deploy, or beingable to modify forwarding and monitoring logic quickly mayreduce the time to diagnose and fix a network outage.

Adding entirely new features. If users and network own-ers can modify the forwarding behavior, they may even addentirely new features. For example, over time we can ex-pect switches to take on more complex routing, such aspath-utilization aware routing [4, 40], new congestion con-trol mechanisms [8, 19, 39], source-controlled routing [58],new load-balancing algorithms [26], new methods to miti-gate DDoS [5, 25], and new virtual-to-physical gateway func-tions [17]. If a network owner can upgrade infrastructure toachieve greater utilization or more control, then they will knowbest how to do it. Given the means to upgrade a program writ-ten in a DSL like P4 for adding new features to a switch, wecan expect network owners to improve their networks muchmore rapidly.

PacketParser

Packet Deparser

CustomMatch-Action

Tables

Ingress Egress

Figure 1: P4 abstract forwarding model.

PacketParser

Fast Path

Slow Path

Cache Miss

Cache Hit

GenericMatch-Action

Tables

Ingress EgressMicroflow Cache

Cache Miss

Actions

MegaflowCache

Cache Hit

Figure 2: OVS forwarding model.

3 Background

PISCES is a software switch whose forwarding behavior isspecified using a domain-specific language. PISCES is basedon the Open vSwitch (OVS) [57] software switch and is config-ured using the P4 domain-specific language [10]. We describeboth P4 and OVS below.s

Domain-Specific Language: P4. P4 is a domain-specific lan-guage that expresses how the pipeline of a network forwardingelement should process packets using the abstract forwardingmodel shown in Figure 1. In this model, each packet firstpasses through a programmable parser, which extracts head-ers. The P4 program specifies the structure of each possibleheader as well as a parse graph that expresses ordering anddependencies. Then, the packet passes through a series ofmatch-action tables (MATs). The P4 program specifies thefields that each of these MATs may match and the control flowamong them, as well as the spectrum of permissible actions foreach table. At “runtime” (i.e., while the switch is forwardingpackets), controller software may add, remove, and modifytable entries with particular match-action rules that conformto the P4 program’s specification. Finally, a deparser writesthe header fields back onto the packet before sending it out theappropriate port.

We choose P4 because its abstract model of a switch is sim-ilar to that of OpenFlow, the language built into OVS, whichallows us to make straightforward apples-to-apples compar-isons of OVS with and without a P4 front end. We consideredother alternative bases, such as Click [44]—used in the Berke-ley Extensible Software Switch (BESS) [30]—that allow forricher computation than match-action processing. However,for our purposes, P4 is sufficient to make the intended com-parisons. There is merit to having a common way to expressforwarding across all “plumbing” switches in a network, andhave code that is portable from one to another. Therefore,using the same language makes sense for these experiments.

As BESS shows, there are other more extensible applicationsfor software switches that are outside the scope of our work.

Software Switch: Open vSwitch. Open vSwitch (OVS) iswidely used in data centers as a software switch running insidethe hypervisor. In such an environment, OVS switches packetsamong virtual interfaces to VMs and physical interfaces. OVSimplements common protocols such as Ethernet, GRE, andIPv4, as well as newer protocols found in data centers, suchas the VXLAN Group Based Policy (GBP) extension [67],Geneve [29], NVGRE [73], and STT [16] for virtual networkoverlays.

The Open vSwitch virtual switch has two important pieces,called the slow path and the fast path (i.e., datapath), as shownin Figure 2. The slow path is a userspace program; it suppliesmost of the intelligence of OVS. The fast path acts as a cachinglayer that contains only the code needed to achieve maximumperformance. Notably, the fast path must pass any packet thatresults in a cache miss to the slow path to get instructionsfor further processing. OVS includes a single, portable slowpath and multiple fast-path implementations for different en-vironments: one based on a Linux kernel module, anotherbased on a Windows kernel module, and another based onIntel DPDK [34] userspace forwarding. The DPDK fast pathyields the highest performance, so we use it for our work; withadditional effort, our work could be extended to the other fastpaths.

As an SDN switch, OVS relies on instructions from a con-troller to determine its behavior, specifically using the Open-Flow protocol [50]. OpenFlow specifies behavior in terms ofa collection of match-action tables, each of which contains anumber of entries called flows. In turn, a flow consists of amatch, in terms of packet headers and metadata, actions thatinstruct the switch what to do when the match evaluates to true,and a numerical priority. When a packet arrives at a particularmatch-action table, the switch finds a matching flow and ex-ecutes its actions; if more than one flow matches the packet,then the flow with the highest priority takes precedence.

A software switch that implements the behavior exactly asdescribed above cannot achieve high performance, becauseOpenFlow packets often pass through several match-actiontables, each of which requires general-purpose packet classifi-cation. Thus, OVS relies on caches to achieve good forwardingperformance. The primary OVS cache is its megaflow cache,which is structured much like an OpenFlow [50] table. Theidea behind the megaflow cache is that one could, in theory,combine all of the match-action tables that a packet visitswhile traversing the OpenFlow pipeline into a single table bycomputing their cross-product. This is infeasible, however, be-cause the cross-product of k tables with n1, . . . ,nk rules mighthave as many as n1 ×·· ·×nk rules. The megaflow cache func-tions somewhat like a lazily computed cross-product: whena packet arrives that does not match any existing megaflowcache entry, the slow path computes a new entry, which corre-sponds to one row in the theoretical cross-product, and inserts

Parse ActionMatch

OVS Source Code

Flow RuleType Checker

OVS Executable

Runtime Flow RulesP4 Program

P4 Compiler

C Code

Slow PathConfiguration

Match-ActionRules

Figure 3: The P4-to-OVS Compiler in PISCES.

it into the cache. OVS uses a number of techniques to improvemegaflow cache performance and hit rate [57].

When a packet hits in the megaflow cache, the switch canprocess it significantly faster than the round trip from thefast path to the slow path that a cache miss would require.As a general-purpose packet classification step, however, amegaflow cache lookup still has a significant cost. Thus, OpenvSwitch fast-path implementations also include a microflowcache, a hash table that maps from a packet five-tuple to amegaflow cache entry. The result of the microflow cachelookup can only be a hint, because megaflows often match onmore fields than just the five-tuple, so that a microflow cacheentry can at best point to the most likely match. Thus, the fastpath must verify that the megaflow cache entry indeed matchesthe packet. If it does match, the lookup cost is just that of thesingle hash table lookup. This lookup cost is generally muchcheaper than general packet classification, so it is a significantoptimization for traffic patterns with relatively long, steadystreams of packets. If it does not match, then the packetcontinues through the usual megaflow cache lookup process,skipping the entry that it has already checked.

4 PISCES Prototype

Our PISCES prototype is a modified version of OVS with theparse, match, and action code replaced by C code generatedby our P4 compiler. The workflow is as follows: First, the pro-grammer creates a P4 program and uses the PISCES versionof the P4 compiler (Section 4.1) to generate new parse, match,and action code for OVS. Second, OVS is compiled (using theregular C compiler) to create a protocol-dependent switch thatprocesses packets as described in the P4 program. To modifya protocol, a user modifies the P4 program, which compiles toa new hypervisor switch binary.

We use OVS as the basis for PISCES because it is widelyused and contains some basic scaffolding for a programmableswitch, thus allowing us to focus only on the parts of the switchthat need to be customized (i.e., parse, match, and action). Thecode is well-structured, lending itself to modification, and testenvironments already exist. It also allows for apples-to-applescomparisons: We can compare the number of lines of code inunmodified OVS to the P4 program for PISCES (Section 5.1),and we can also compare their performance (Section 5.2).

4.1 The P4-to-OVS Compiler in PISCES

P4 compilers have two parts: a front end that turns the P4 codeinto a target-independent intermediate representation (IR), anda back end that maps the IR to the target. In our case, theback end optimizes CPU time, latency, or other objectives bymanipulating the IR, and then generates C code that replacesthe parsing, match, and action code in OVS, as shown inFigure 3. The P4-to-OVS compiler outputs C source code thatimplements everything needed to compile the correspondingswitch executable. The compilation process also generatesan independent type checking program that the executableuses to ensure that any runtime configuration directives fromthe controller (e.g., insertion of flow rules) conforms to theprotocol specified in the P4 program.

Parse. The C code that replaces the original OVS parseris created by replacing struct flow, the C structure thatOVS uses to track protocol header fields, to include a memberfor each field specified by the P4 program, and generating codeto extract header fields from a packet into struct flow.

Match. OVS uses a general-purpose classifier data structure,based on tuple-space search [69], to implement matching. Toperform custom matches, we do not need to modify this datastructure or the code that manages it. Rather, the control planecan simply populate the classifier with new packet headerfields at runtime, thereby automatically making those fieldsavailable for packet matching.

Action. The back end of our compiler supports custom actionsby automatically generating code that we statically compileinto the OVS binary. Custom actions can execute either in theOVS slow path or the fast path; the compiler determines wherea particular action will run to ensure that the switch performsthe actions efficiently. Certain actions (e.g., set field)can execute in either component. The programmer can offerhints to the compiler as to whether slow path or fast pathimplementation of an action is most appropriate.

Control flow. In a switch, a packet’s control flow is thesequence of match-action tables that the packet traverses.Whereas with P4, control flow must be specified at the pro-gram’s compile time, in OVS control flow is specified at run-time, via flow entries, which makes it more flexible. Therefore,our compiler back end can implement P4 control semanticswithout OVS changes.

Optimizing the IR. The compiler back end contains an opti-mizer to examine and modify the IR, so as to generate high-performance C code. For example, a P4 program may includea complete IP checksum, but the optimizer can turn this opera-tion into an incremental IP checksum to make it faster. Thecompiler also performs data-flow analysis on the IR [3], allow-ing it to coalesce and specialize the C code. The optimizer alsodecides when and where in the packet processing pipeline toedit packet headers. Some hardware switches postpone editinguntil the end of the pipeline, whereas software switches typi-cally edit headers at each stage in the pipeline. If necessary,

the optimizer converts the IR for in-line editing. We describethe optimizer in more detail in Section 4.3.

As is the case with other P4 compilers [10, 36], the P4-to-OVS compiler also generates an API for the match-actiontables, and extends the OVS command-line tools to work withthe new fields.

4.2 Modifications to OVSWe need to make three modifications to OVS to enable itto implement the forwarding behavior described in any P4program.

Arbitrary encapsulation and decapsulation. OVS does notsupport arbitrary encapsulation and decapsulation, which a P4program might require. Each OVS fast path provides customsupport for various fixed forms of encapsulation. The Linuxkernel fast path and DPDK fast path, for example, each sep-arately implement GRE [22], VXLAN [47], STT [16], andother encapsulations. The metadata required to encapsulateand decapsulate a packet for a tunnel is statically configured.The switch uses a packet’s ingress port to map it to the ap-propriate tunnel; on egress, the packet is encapsulated in thecorresponding IP header based on this static tunnel config-uration. We therefore added two new primitives to OVS,add header() and remove header(), to perform en-capsulation and decapsulation, respectively, and perform theseoperations in the fast path.

Conditionals based on comparison of header fields. Open-Flow directly supports only bitwise equality tests againstheader fields. Relational tests such as < and > to comparea k-bit field against a constant can be expressed as at mostk rules that use bitwise equality matches. A relational testbetween two k-bit fields, such as x < y, requires k(k+ 1)/2such rules. To simultaneously test for two such conditions thatindividually take n1 and n2 rules, one needs n1 ×n2 rules. P4directly supports such tests, but implementing them in Open-Flow this way is too expensive, so we added direct support forthem in OVS as conditional actions, a kind of “if” statementfor OpenFlow actions. For example, our extension allows theP4 compiler to emit an action of the form “If x < y, go to table2, otherwise go to table 3.”

General checksum verify/update. An IP router should ver-ify the checksum at ingress, and recompute it at egress, andmost hardware switches do it this way. A software routeroften skips checksum verification on ingress to reduce CPUcycles. Instead, it just incrementally updates the checksumif it changes any fields (e.g., the TTL).1 Currently, OVS onlysupports incremental checksums, but we want to support otheruses of checksums in the way the programmer intended. Wetherefore added incremental checksum optimization, describedin Section 4.3. Whether this optimization is valid depends onwhether the P4 switch is acting as a forwarding element oran end host for a given packet—if it is an end host, then it

1If the checksum was incorrect before the update, it is still incorrectafterward, and we rely on the ultimate end host to discard the packet.

Optimization CPU Cycles Slow-Path TripsInline- vs. post-pipeline editing XIncremental checksum XParser specialization XAction specialization XAction coalescing XCached field modifications X XStage assignment X X

Table 1: Back-end optimizations and how they improve performance.

must verify the checksum—so it requires annotation by the P4programmer.

4.3 The Compiler’s Back-end Optimizer

Two aspects of a software switch ultimately affect forwardingperformance: (1) the per-packet cost for fast-path processing(adding 100 cycles to this cost reduces the switch’s throughputby about 500 Mbps), and (2) the number of packets sent to theslow path, which takes 50+ times as many cycles as the fastpath to process a packet. Table 1 lists the optimizations thatwe have implemented, as well as whether the optimizationreduces trips to the slow path, fast path CPU cycles, or both.The rest of the section details these optimizations.

Inline editing vs. post-pipeline editing. The OVS fast pathperforms inline editing, applying packet modifications imme-diately (the slow path does some simple optimization to avoidredundant or unnecessary modifications). If many headerfields are modified, removed or inserted, it can become costlyto move and resize packet data on the fly. Instead, it can bemore efficient to delay editing until the headers have beenprocessed (as hardware switches typically do). The optimizeranalyzes the IR to determine how many times a packet mayneed to be modified in the pipeline. If the value is below acertain threshold, then the optimizer performs inline editing;otherwise, it performs post-pipeline editing. We allow the pro-grammer to override this heuristic using a pragma directive.

Incremental checksum. By expressing a checksum operationin terms of a high-level program description such as P4, a pro-grammer can provide a compiler with the necessary contextualinformation to implement the checksum more efficiently. Forexample, the programmer can inform the compiler via anno-tations that the checksum for each packet can be computedincrementally [51]; the optimizer can then perform data-flowanalysis to determine which packet header fields change, thusmaking re-computation of the checksum more efficient.

Parser specialization. Protocol-independent softwareswitches can optimize the implementation of the packet parser,since a customized packet processing pipeline (as specifiedin a high-level language such as P4) provides specific infor-mation about which fields in the packet are modified or usedas the basis for forwarding decisions. For example, a layer-2 switch that does not make forwarding decisions based oninformation at other layers can avoid parsing packet headerfields at those layers. Specifying the forwarding behavior in a

high-level language provides the compiler with informationthat it can use to optimize the parser.Action specialization. The inline editing actions in the OVSfast path group together related fields that are often set at thesame time. For example, OVS implements a single fast pathaction that sets the IPv4 source, destination, type of service,and TTL value. This is efficient when more than one of thesefields is to be updated at the same time, with little marginalcost if only one is updated. IPv4 has many other fields, but thefast path cannot set any of them.

The design of this aspect of OVS required domain expertise:its designers knew which fields were important for the fast pathto be able to change. A P4 compiler does not have this kind ofexpert knowledge of which fields to group together, yieldinga possible cost for grouping too few or too many fields into asingle action. Fortunately, the high-level P4 description of thematch-action control flow allows the optimizer to identify andeliminate redundant checks in the fast-path set actions, usingoptimizations like dead-code elimination [3]. This way, theoptimizer only checks those fields in the set actions that willactually be set in the match-action control flow.Action coalescing. By analyzing the control flow and match-action processing in the P4 program, the compiler can discoverwhich fields are actually modified and can generate an efficient,single action to directly update those fields. Thus, if a rulemodifies two fields, the optimizer only installs one action inOVS.Cached field modifications. Network protocol data planesrarely require arithmetic operations on header fields. TTLdecrement operations are the most obvious counterexample;checksums, already addressed above, are another. Thus, OVSfast paths do not include general-purpose arithmetic operations.In fact, they do not include a special-purpose TTL decrementoperation either. Instead, to implement the special-purposeOpenFlow action to decrement a TTL, the slow path relies onthe fact that most packets from a given source have the sameTTL. Therefore, it emits a cache entry that matches on theTTL value observed in the packet that it is forwarding andoverwrites this value with one less than that observed value, anapproach we call “match-and-set.” For TTL decrement, thissolution is acceptable because the OVS designers know thatcaching this way yields a high hit rate in practice.2

Match-and-set is not always appropriate. As a straw man,consider update of the IPv4 or IPv6 checksum given a changein some other IP field. With a match-and-set approach, thecache entry would have to match on every field that contributesto the checksum, that is, every IP field, which would reducethe cache entry’s hit rate nearly to zero. The same can betrue for simpler arithmetic operations that P4 supports, suchas incrementing or decrementing a field value, and in theend PISCES has no way to know whether match-and-set isappropriate in a given case.

2In addition, real-world uses of TTL decrement are always paired witha “TTL exceeded” check that would itself cause the cache entry to match onTTL, which would negate the value of a special-case TTL decrement action.

The solution that PISCES takes is to avoid match-and-setwhen it can, by automatically generating fast path operationsto implement the particular arithmetic operations that a P4program requires. For example, if the program incrementsa particular field, PISCES generates a fast path operation toincrement that field. This is effective when the P4 programexecutes the arithmetic operation “blindly,” without otherwisematching on the modified field’s value. If the program doesmatch on it, then, following the usual rules for caching, thecache entry must match on the field, so that a match-and-setapproach is necessary.

Stage assignment. OVS implements staged lookup [57] toreduce the number of trips to the slow path. Staged lookupdivides fields into a ordered list of groups, called stages. Thestages are cumulative, so that each stage after the first containsall of the fields from the previous stages plus additional fields.The final stage contains every field. OVS implements eachstage as a separate hash table in its tuple space search classifier.A classifier lookup searches each of these stages in order. Ifany search yields no match, the overall search terminates andonly the fields included in the last stage must be matched inthe cache entry.

OVS uses four such stages: the first stage is metadata fields(such as the packet’s ingress port), the second is metadataand layer-2 fields, the third adds layer-3 fields, and the fourthincludes all fields (i.e., metadata, layer 2, 3, and 4). This orderis based on the principle that stages are most effective whentheir order corresponds to increasing order of entropy in theobserved values of fields for networks [66]. In the commoncase, for example, a cache entry that matches on metadataonly is likely to have a higher hit rate than a cache entry thatmatches only on layer-4 fields, so metadata first appears in anearlier stage (the first stage) than do layer-4 fields (the finalstage).

Staged lookup generalizes to arbitrary P4 programs. Thisordering cannot be inferred from the P4 program, so PISCESneeds assistance to choose appropriate stages. We augmentedthe P4 language to enable a user to annotate each header with astage number. The number of stages is the same as the numberof headers.

5 EvaluationWe compare the complexity and performance of a PISCESvirtual software switch with equivalent OVS native packet pro-cessing. We compare the resulting programs along two dimen-sions: (1) complexity, including development and deploymentcomplexity as well as maintainability; (2) performance, bycomparing packet-forwarding performance of PISCES to thesame native OVS functionality.

5.1 ComplexityComplexity indicates the ease with which a program maybe modified to fix defects, meet new requirements, simplifyfuture maintenance, or cope with changes in the software en-vironment. We evaluate two categories of complexity: (1) de-

LoC Methods Method Size

OVS 14,535 106 137.13PISCES 341 40 8.53

Table 2: Native OVS compared to equivalent baseline functionalityimplemented in PISCES.

Files LinesChanged Changed

Connection Label:OVS [70, 71] 36 633PISCES 1 5

Tunnel OAM Flag:OVS [27, 28] 21 199PISCES 1 6

TCP Flags:OVS [61] 20 370PISCES 1 4

Table 3: The number of files and lines we needed to change to imple-ment various functionality in P4, compiled with PISCES, comparedto adding the same functionality to native OVS.

velopment complexity of developing baseline features for asoftware switch; and (2) change complexity of maintaining anexisting software switch.

5.1.1 Development complexity

We evaluate development complexity with three different met-rics: lines of code, method count, and average method size.We count lines of code simply by counting line break charac-ters and the number of methods by counting the number ofsubroutines in each program, as measured using ctags [33].Finally, we divide lines of code by number of methods to ar-rive at the average method size. A high average might indicatethat (some) methods are too verbose or complex.

Writing a compiler is a one-time cost. Whereas developersupdate their P4 programs frequently, the compiler is changedmuch less often—usually when the P4 language specificationchanges. For PISCES, we write about 1,000 lines of code forcompiling P4 to C code, and an extra 1,700 lines of code toextend the native OVS to incorporate the generated C code.ovs.p4 [1] contains the representation of the headers,

parsers, and actions that are currently supported in OVS. Muchof the code in OVS is out of the scope of P4, so our measure-ments include only the files that are responsible for protocoldefinitions and header parsing. Table 2 summarizes each ofthese metrics for the native OVS header fields and parserimplementation, and the equivalent logic in P4.3 PISCES re-duces the lines of code by about a factor of 40 and the averagemethod size by about a factor of 20.

5.1.2 Change complexity

To evaluate the complexity of maintaining a protocol-independent software switch in PISCES, we compare the effortrequired to add support for a new header field in a protocolthat is otherwise already supported, in OVS and in P4. Ta-

3We reuse the same code for the match-action tables in both implementa-tions because this logic generalizes for both OVS and a protocol-independentswitch such as PISCES.

PISCESSwitch

DPDKMoonGenTraffic

Source/Sink

MoonGenTraffic

Source/Sink

3x10G 3x10G

Figure 4: Topology of our evaluation platform.

ble 3 shows our analysis of changes to add support for threefields: (1) connection label, a 128-bit custom metadata to theconnection tracking interface; (2) tunnel OAM flag, whichmany networking tools use to distinguish test packets fromreal traffic; and (3) TCP flags, a modification that adds supportfor parsing all of the TCP flags. Table 3 shows the changes toOVS based on the public Open vSwitch commits. These num-bers are conservative because they include only the changesto one of the three OVS fast-path implementations.

The results demonstrate that modifying just a few lines ofcode in a single P4 file is sufficient to support a new field,whereas in OVS, the corresponding change often requireshundreds of lines of changes over tens of files. Among otherchanges, one must add the field to struct flow, describeproperties of the field in a global table, implement a parser forthe field in the slow path, and separately implement a parserin one or more of the fast paths.

5.2 Forwarding PerformanceIn this section, we compare OVS and PISCES packet-forwarding performance.

5.2.1 Experiment setup and evaluation metrics

Figure 4 shows the topology of the setup for evaluating theforwarding performance of PISCES. We use three PowerEdgeR730xd servers with two 8-core, 16-thread Intel Xeon E5-2640 v3 2.6GHz CPUs running the Proxmox Virtual Environ-ment [59], an open-source server virtualization platform thatuses virtual switches to connect VMs, with Proxmox Kernelversion 4.2.6-1-pve. Each of our machines is equipped withone dual-port and one quad-port Intel X710 10 Gbps NIC.We configure two such machines with MoonGen [20] to sendminimum-size 64-byte frames at 14.88 million packets persecond (Mpps) full line rate on three of the 10 Gbps inter-faces [64], leaving the other interfaces unused. We connectthese six interfaces to a third machine, the device under test,sending a total of 60 Gbps of traffic for PISCES to forward.

We consider throughput and packets-per-second to comparethe forwarding performance of PISCES and OVS, using theMoonGen packet generator to generate test traffic for our ex-periments. We configure PISCES and OVS with six Poll ModeDriver (PMD) threads—one for each 10 Gbps interface—ina Run-to-Completion (RTC) model [35]. Each thread runson a separate CPU core attached to one of the Non-UniformMemory Access (NUMA) [45] nodes on the machine. Tofurther understand performance bottlenecks, we use the ma-chine’s time-stamp counter (TSC) to measure the number ofCPU cycles used by various packet processing operations (i.e.,parser, megaflow cache lookup, and actions). When reporting

0 10 20 30 40 50 60 70

64 128 192 256

Thr

ough

put (

Mpp

s)

Packet Size (Bytes)

With MicroFlow Cache Without MicroFlow Cache

(a) Forwarding performance in millions of packets per second with astandard deviation of less than 0.035 Mpps for all data points.

0 10 20 30 40 50 60 70

64 128 192 256

Thr

ough

tput

(Gbp

s)

Packet Size (Bytes)

With MicroFlow Cache Without MicroFlow Cache

(b) Forwarding performance in gigabits per second with a standarddeviation of less than 0.026 Gbps for all data points.

Figure 5: Forwarding performance for OVS with and without themicroflow cache enabled, for input traffic of 60 Gbps across all sixports and one flow rule per port.

CPU cycles, we report the average CPU cycles per packet overall packets forwarded in an experiment run; each run lasts for30 seconds and has an ingress rate of 89.28 Mpps.Calibrating OVS to enable performance comparison. Tomore accurately measure the cost of parsing for both OVSand PISCES in subsequent experiments, we begin by estab-lishing a baseline for OVS performance with minimal parsingfunctionality. To minimize the cost of parsing, we disablethe parser, which ordinarily parses a comprehensive fixed setof headers, so that it reports only the input port. After thischange, we send test traffic through the switch with a trivialflow table that matches every packet that ingresses on port 1and sends it to port 2.

We measured the performance of this modified OVS. Fig-ures 5a and 5b show the maximum throughput that our setupachieves with OVS, with and without the microflow cache, for60-Gbps traffic. For 64-byte packets, disabling the microflowcache reduces performance by about 35%, because a lookupin the OVS megaflow cache consumes five times as many cy-cles as the microflow cache (Table 4). For small packets, theOVS switch is CPU-bound on lookups; thus, in this operatingregime, the benefit of the microflow cache is clear.

With this calibration in mind, for the remainder of this sec-tion, we use the forwarding performance for OVS with themicroflow cache disabled as the basis for our performancecomparison to PISCES. We disable the microflow cache be-cause it relies on matching a hash of a packet’s five-tuple,which most NICs can compute directly in hardware. Although

Switch With WithoutComponents MicroFlow MicroFlow

Parser 19.0 18.9MicroFlow Cache 18.9 —MegaFlow Cache — 92.2Slow Path — —Fast-Path Actions 39.9 38.8

End-to-End 100.6 166.0

Table 4: Average number of cycles per packet consumed by eachelement in the virtual switch when processing a 64-byte packet.

0

10

20

30

40

50

64 128 192 256

Thr

ough

put (

Gbp

s)

Packet Size (Bytes)

PISCES PISCES (Optimized) OVS

Figure 6: Throughput comparison of L2L3-ACL benchmark appli-cation between OVS and PISCES in gigabits per second, with astandard deviation of less than 0.023 Gbps for all data points.

OVS’s microflow cache significantly improves its forwardingperformance, this feature relies on protocol-dependent features(specifically, that the packet has a five-tuple in the first place).Because our goal is to evaluate forwarding rates for protocol-independent switches, we disabled OVS’s microflow cacheso that we could compare PISCES, a protocol-independentswitch, with a version of OVS that has no protocol-dependentoptimizations. Comparing PISCES performance to that ofOVS with microflow caching disabled thus offers a moreapples-to-apples performance comparison, although it makesit difficult to interpret performance versus “real-life OpenvSwitch.” We expect that implementing a microflow cache inPISCES, by adding P4 annotations for the fields to be hashedand then hashing them in software, would recover most of theperformance.

5.2.2 End-to-end performance

We next measure the forwarding performance of a real-worldnetwork application for both OVS and PISCES. This eval-uation provides a clear illustration of the end-to-end perfor-mance costs of programmability. We select a realistic andrelatively complex application where both switch implemen-tations provide all packet processing features to provide afair performance comparison of PISCES in realistic networksettings.

Figure 7 shows this application, which we call “L2L3-ACL.”It performs the following operations:

• Parse Ethernet, VLAN, IP, TCP and UDP protocols.• Perform VLAN encapsulation and decapsulation.

• Perform control-flow and match-action operations ac-cording to Figure 7 to implement an access control list(ACL).

• Set Ethernet source, destination, type and VLAN fields.• Decrement IP’s TTL value.• Update IP checksum.

Table 5 shows the forwarding performance results for this ap-plication. The most important rows are the last two, whichshow a “bottom line” comparison between OVS and PISCES,after we apply all compiler optimizations. These results showthat both the average number of CPU cycles per packet and theaverage throughput for PISCES with all compiler optimiza-tions is comparable to OVS with microflow caching disabled:both require just over an average of 400 CPU cycles per packet,and both achieve throughput of just over 13 Gbps—a perfor-mance overhead of less than 2%. Figure 6 demonstrates thatthis result also holds for larger packet sizes. In all cases,PISCES with compiler optimizations enabled in its compilerachieves performance comparable to OVS.

Next, we discuss in more detail the performance benefitsthat each compiler optimization achieves for this end-to-endapplication.Individual compiler optimizations. P4 supports post-pipeline editing, so we start by compiling L2L3-ACL withpost-pipeline editing. PISCES requires an average of 737 cy-cles to process a 64-byte packet. Packet parsing and fast-pathactions are primarily responsible for these additional CPU cy-cles. As our microbenchmarks demonstrate (Section 5.2.3), ifthe number of adjustments to packets are less than eight, usinginline-editing mode provides better forwarding performance.Based on that insight, the PISCES version of the P4 compileruses inline editing, which reduces the number of cycles con-sumed by the parser by about 56%. However, fast-path actions’cycles slightly increased (still 255 cycles more than OVS).

Next, we introduce incremental checksum updates to reducethe number of cycles consumed by the fast-path actions. Theonly IP field that is modified is TTL, but the full checksumverify and update design supported by P4 abstract model runsthe checksum over entire headers once at the ingress and onceat egress. For our P4 program, we specify that we want touse incremental checksum. Using this knowledge, instead ofrecalculating checksum on all header fields, using data-flowanalysis on the P4 program (MAT and control-flow), the P4compiler determines that the pipeline modifies only the TTLand adjusts the checksum using only that field, which reducesthe number of cycles consumed by the fast-path actions by59.7%, a significant improvement. However, PISCES stillconsumes 23.24 more cycles than OVS.

To further improve the performance we apply action special-ization and coalescing, and parser specialization (Section 4.3).This brings the number of cycles consumed per packet byPISCES to 425.82.Parser specialization. A protocol-independent switch onlyneeds to parse the packet-header fields for the protocols de-fined by the programmer. The compiler in PISCES can op-

Switch Optimization Parser MegaFlow Fast-Path End-to-End ThroughputCache Actions (Avg.) (Mbps)

PISCES Baseline 76.5 209.5 379.5 737.4 7590.7Inline Editing -42.6 — +7.5 -45.4 +281.0Inc. Checksum — — -231.3 -234.5 +4685.3Action Specialization — — -10.3 -9.2 +191.2Parser Specialization -4.6 — — -7.6 +282.3Action Coalescing — — -14.6 -14.8 +293.0

All optimizations 29.7 209.0 147.6 425.8 13323.7

OVS — 43.6 197.5 132.5 408.7 13497.5

Table 5: Improvement in average number of cycles per packet, consumed by each element in the virtual switch when processing 64-byte packet,for L2L3-ACL benchmark application. (Most listed optimizations for the PISCES version of the P4 compiler do not have any counterpart inOVS, but OVS does implement incremental checksums.)

VLANIngressProcessing

Match:ingress_portvlan.vidAction:add_vlanno_op

MACLearning

Match:eth.srcAction:learnno_op

Switching

Match:eth.dstvlan.vidAction:forwardbcast

Routing

Match:ip.dstAction:nexthopdrop

Routable

Match:eth.srceth.dstvlan.vidAction:no_op

ACL

Match:ip.src,ip.dstip.prtcl,port.src,port.dstAction:no_opdrop

VLANEgressProcessing

Match:egress_portvlan.vidAction:remove_vlanno_op

route

Figure 7: Control flow of L2L3-ACL benchmark application. Each of these tables contains a list of fields to match on and a set of actions tochoose from when installing a flow rule. For example, in VLAN Ingress Processing, one can match on ingress port and VLAN id, and canperform add vlan or no op actions.

timize the parser further to only parse the header fields thatthe switch needs to process the packet. To evaluate the poten-tial benefits of this specialization, we repeat our end-to-endperformance evaluation using two subsets of the L2L3-ACLprogram: the “L2L3” program, which does not perform theACL functions, and the “L2” program, which manipulates theEthernet and VLAN headers and performs VLAN encapsula-tion, but which does not parse any IP headers or decrementthe TTL (and thus does not update the IP checksum). In termsof the control flow from the original “L2L3-ACL” benchmarkprogram from Figure 7, the “L2L3” program removes the darkgrey ACL tables, and the “L2” program additionally removesthe light grey Routable and Routing tables.

Table 6 compares the forwarding performance of OVS andPISCES for these two programs. For L2L3, PISCES consumesfour more cycles per packet than OVS. However, PISCES hasfaster parsing: compared to L2L3-ACL, parsing in L2L3 isabout seven cycles per packet cheaper. OVS uses a fixed parser,so its cost remains constant. Parser specialization removesredundant parsing of fields from the parser that are not usedin the control-flow (i.e., TCP and UDP headers). BecauseOVS does not know the control-flow and MAT structure apriori, its parser cannot achieve the same specialization. In thecase of the L2 application, the parser could specialize further,since it needs only to parse Ethernet headers. In this case,PISCES can actually process packets more quickly than theprotocol-dependent switch.

5.2.3 Microbenchmarks

We now evaluate the performance of individual componentsof PISCES. We focus on the parser and actions, which areapplied on every incoming packet and have the largest effect onperformance. We now benchmark how increasing complexityin both parser and actions affect the overall performance ofPISCES.

Parser performance. Figure 8a shows how per-packet cyclecounts increase as the P4 program parses additional protocols,for both post- and inline-editing modes. To parse only theEthernet header, the parser consumes about 20 cycles, in ei-ther mode. As we introduce new protocols, the cycle countincreases, more rapidly for post-pipeline editing, for whichthe switch creates an extra copy of the protocol headers forfast-path actions. For the largest protocol combination in Fig-ure 8a, the parser requires about 133 cycles (almost six timesas many cycles as simply processing an Ethernet frame) forpost-pipeline editing and 54 cycles for inline-editing. Fig-ure 8b shows how the throughput decreases with the additionof each new protocol in the parser. For input traffic at 60 Gbps,switching throughput decreases about 35%, from 51.1 Gbpsto 33.2 Gbps, for post-pipeline editing and about 24%, from52.4 Gbps to 40.0 Gbps, for inline editing.

Fast-path action performance. Performance-wise, the dom-inant action in a virtual switch is the set-field (or modify-field)action or, in other words, a write action. Figure 9 shows theper-packet cost, in cycles, as we increase the number of set-

Switch Programs Optimizations Parser MegaFlow Fast-Path End-to-End ThroughputCache Actions (Avg.) (Mbps)

PISCES L2L3 Optimized 22.9 188.4 130.5 392.3 14159.1OVS L2L3 — 43.6 176.0 131.8 388.3 14152.2

PISCES L2 Optimized 19.7 148.2 90.9 305.7 18118.5OVS L2 — 43.6 155.2 78.7 312.1 17131.3

Table 6: Improvement in average number of cycles per packet, consumed by each element in the virtual switch when processing 64-byte packet,for L2L3 and L2 benchmark applications.

0 20 40 60 80

100 120 140

Eth +IP +UDP +VXLAN +Eth+IP+ICMP

CPU

Cyc

les p

er P

acke

t

Protocol Combinations

Post-Pipeline Editing Inline Editing

(a) CPU cycles.

0

10

20

30

40

50

60

Eth +IP +UDP +VXLAN +Eth+IP+ICMP

Thr

ough

put (

Gbp

s)

Protocol Combinations


(b) End-to-end throughput with a standard deviation of less than0.063 Gbps for all data points.

Figure 8: Effect on parser CPU cycles and end-to-end throughputas more protocols are added to the parser.

field actions in the fast path for both post- and inline-editingmodes. In post-editing mode, we apply our changes to a copyof the header fields (extracted from the packet) and at theend of the pipeline execute a “deparse” action that writes thechanges back to the packet. The “deparse” bar shows howdeparsing consumes about 99 cycles even if no fields are mod-ified, whereas inline editing has no cost in this case. As thenumber of writes increases, the performance difference be-tween the two modes narrows. For 16 writes, this difference is20 cycles less than for a single write. Still, in both cases, thenumber of cycles increases. For post-editing case, 16 writesconsumes 354 cycles, about 3.6 times that of a single write;for inline editing, 16 writes consumes 319 cycles, or about 5.6times more cycles than a single write.

We also measured cycles-per-packet for adding or removingheaders. Figures 10 and 11 show cycles-per-packet for anincreasing number of add-header and remove-header actions,respectively, in the post-pipeline and inline-editing modes.

0 50

100 150 200 250 300 350 400

Deparse x1 x2 x4 x8 x16

CPU

Cyc

les p

er P

acke

t

Number of Set-Field Actions


Figure 9: Fast Path Set-Field Action Performance.

For the add-header action, for inline-editing mode, the num-ber of cycles doubles for every new action. This is becausethese actions are applied directly on the packet, adjusting thepacket size each time. In contrast, post-pipeline editing adjuststhe packet size only once, in the “deparse” action, so that thenumber of cycles consumed remains almost constant. For asingle add-header action, post-editing cost is higher, but forfour or more actions the inline-editing mode is more costly.For 16 add-header actions, inline editing consumes 577 morecycles per packet than post-pipeline editing.

We observe a similar trend for remove-header action. Thereis one additional wrinkle: as the number of remove-headeractions increases, the cost of post-pipeline editing actuallydecreases slightly, because fewer bytes need to be adjusted inthe packet as the packet shrinks. As we increase the number ofremove-header actions from 1 to 16, the per-packet cycle countdecreases by about 21%. This led us to the following rule ofthumb: for fewer than 8 packet-size adjustments (i.e., add-and remove-header actions), the compiler uses inline-editing;otherwise, it applies post-pipeline editing, as the added numberof cycles required by the parser to generate a copy of the parsedpacket headers is offset by the number of cycles required bythe add/remove header actions in the inline-editing mode.Slow-path forwarding performance. When OVS must sendall packets to the slow path, it takes on average about 3,500 cy-cles to process a single packet (about 50 times the cyclesincurred for a microflow cache hit). In this case, the maximumpacket forwarding rate is about 0.66 Mpps regardless of packetsize. This per-packet cycle count for slow-path processing wasfor the simplest possible program that sends every packet tothe same output port. Most real packet processing programswould require significantly more cycles. For example, for theL2L3-ACL program, slow-path processing required anywherefrom 30,000 to 60,000 CPU cycles per packet. These perfor-

0 100 200 300 400 500 600 700 800


CPU

Cyc

les p

er P

acke

t

Number of Add-Header Actions


Figure 10: Fast Path Add-Header Performance.

0 100 200 300 400 500 600 700 800


CPU

Cyc

les p

er P

acke

t

Number of Remove-Header Actions


Figure 11: Fast Path Remove-Header Performance.

mance numbers indicate the importance of the megaflow cacheoptimizations that we described in Section 4.3 to reduce thenumber of trips to the slow path. Clearly, the number of tripsto the slow path depends on the actual traffic mix (because thisaffects cache hit rates in the megaflow cache), so it is difficultto state general results about the benefits of these optimiza-tions, but computing the slowdown as a result of cache missesis straightforward.

Control flow. Control flow in OVS, and thus in PISCES, isimplemented in the slow path. It has a small one-time cost,which is impossible to separate from slow path performancein general, at the setup of every new flow.

6 Related WorkPISCES protocols and packet-processing functions can bespecified using a high-level domain-specific language forpacket processing. Although PISCES uses P4 as its high-level language and OVS as its software switch, previous workhas developed both domain-specific languages for packet pro-cessing and virtual software switches, where our approachesfor achieving protocol independence and efficient compilationfrom a DSL to a software switch may also apply.

Domain-specific languages for packet processing. TheP4 language provided the main framework for protocolindependence[10]; PISCES realizes protocol independencein a real software switch. P4 itself borrows concepts fromprior work [7, 23, 48]; as such, it may be possible to ap-ply similar concepts that we have implemented in PISCESto other high-level languages. Although PISCES compilesP4 to OVS source code, the concepts and optimizations that

we have developed could apply to other high-level languagesand target switches; an intermediate representation such asNetASM [65] could ultimately provide a mechanism for acompiler to apply optimizations for a variety of languages andtargets. Languages such as Pyretic [62] and Frenetic [24] aredomain-specific languages that specify how packets should beprocessed by a fixed-function OpenFlow switch. They wouldrequire significant adaptation to take advantage of the abilitiesof a programmable switch. Also, compiling packet programsto reconfigurable hardware switches [36] and FPGAs [12, 63]differs from compiling to software switches. For hardwareswitches, the focus is on constrained optimization problemswhere, given a relatively small chip or memory footprint, thegoal is to use that space optimally while satisfying dependen-cies. Such an approach is not likely to be effective for softwareswitches, which do not have the same kinds of constraints.Virtual software switches. Existing methods and frame-works for building software switches like Linux Kernel [46],DPDK [34], Netmap [64], Click [44], and BPF [14, 15, 49]require intimate knowledge about the underlying implementa-tion and, thus, make it difficult for a network programmer torapidly adapt and add new features to these virtual switches.PISCES, on the other hand, allows programmer to specifypacket processing behavior independent of the underlyingimplementation details. Open vSwitch (OVS) [57] providesinterfaces for populating its match-action tables but does notprovide mechanisms to customize protocols and actions.Other programmable switches. Software routers such asRouteBricks [18], PacketShader [31], and GSwitch [72] relyon general-purpose processors or GPUs to process packets;these designs generally focus on optimizing server, networkinterface, and processor scheduling to improve the perfor-mance of the software switch. These switches do not enableprogrammability through a high-level domain-specific lan-guage such as P4, and they also do not function as hypervisorswitches. CuckooSwitch [74] can be used as a hypervisorswitch. However, it focuses on providing fast forwarding ta-ble lookups by using highly-concurrent hash tables based onCuckoo hashing [54], and, also does not provide a high-leveldomain-specific language to configure the switch. Switch-Blade [6] enables some amount of protocol customizationand forwards packets at hardware speeds, but also acts as astandalone switch and requires an FPGA as a target.Measuring performance. Previous work has both mea-sured [9, 21] and improved [14, 15, 34, 49, 56, 57, 64] theperformance of software virtual switches. Work on measure-ment has converged on a set of performance metrics to com-pare various switch architectures and implementations; ourevaluation uses these metrics to compare the performance ofPISCES to that of other virtual switches.Measuring complexity. A number of metrics for measuringthe complexity and maintainability of a program written in adomain-specific language are developed in software engineer-ing [13, 32, 37, 38, 52]. One of the goal of PISCES is to makeit easier for the programmer to develop and maintain code.

For our evaluation, we have taken these metrics from softwareengineering to evaluate the complexity of writing a programin P4 vs. directly modifying the OVS source code in C.

7 ConclusionThe increasing use of software hypervisor switches in datacenters has introduced the need to rapidly modify the packetforwarding behavior of these software switches. Today, mod-ifying these switches requires both intimate knowledge ofthe switch codebase and extensive expertise in network pro-tocol design, making the bar for customizing these softwareswitches prohibitively high. As an alternative to this mode ofoperation, we developed PISCES, a programmable, protocol-independent software switch that allows a protocol designer tospecify a software switch’s custom packet processing behaviorin a high-level domain-specific language (in our case, P4); acompiler then produces source code for the underlying targetsoftware switch (in our case, OVS). PISCES programs areabout 40 times more concise than the equivalent programs innative code for the software switch. We demonstrated that,with appropriate compiler optimizations, this drastic reduc-tion in complexity incurs only a small performance overheadcompared to the native software switch implementation.

Our prototype demonstrates the feasibility of a protocol-independent software switch using P4 as the programminglanguage and OVS as the target switch. Moreover, our tech-niques for software switch protocol independence and forcompiling a domain-specific packet-processing language toan efficient low-level implementation should generalize toother languages and targets. One way to achieve language-and target-independence would be to first compile the domain-specific languages to a protocol-independent high-level in-termediate representation (HLIR) such as protocol-obliviousforwarding [68] or NetASM [65], then apply the techniquesand optimizations from PISCES to the HLIR.

Another future enhancement for PISCES is to enable cus-tom parse, match, and action code to be dynamically loadedinto a running protocol-independent switch. PISCES currentlyrequires recompilation of the switch source code every time theprogrammer changes the P4 specification. In certain instances,such as adding new features and protocols to running produc-tion switches or temporarily altering protocol behavior to addvisibility or defend against an attack, dynamically loadingcode in a running switch would be valuable. We expect futureprogrammable protocol-independent software switches to sup-port dynamically loading new or modified packet-processingcode. Finally, PISCES does not implement P4 features thatmaintain state across packets (i.e., counters, meters, or reg-isters), which would require extending and generalizing theOpen vSwitch caching model to achieve acceptable perfor-mance.

It is too early to see the effects of PISCES on protocol de-velopment, but the resulting code simplicity should make iteasier to deploy, implement, and maintain custom softwareswitches. In particular, protocol designers can maintain theircustom software switch implementations in terms of a high-

level domain-specific language like P4 without needing totrack the evolution of the (larger and more complex) underly-ing software switch codebase. The ability to develop propri-etary customizations without having to modify (and track) thesource code for a software switch such as OVS might also bea selling point for protocol designers. We intend to study andcharacterize these effects as we release PISCES and interactwith the protocol designers who use it.

AcknowledgmentsWe thank our shepherd Jeff Mogul, William Tu, and the anony-mous SIGCOMM reviewers for their valuable feedback thathelped improve the quality of this paper. We also thank Chai-tanya Kodeboyina, Mihai Budiu, Ramkumar Krishnamoor-thy, Antonin Bas, Abhinav Narain, and Bilal Anwer for theirinvaluable support at various stages of this project. This re-search was supported by Open Networking Research Center(ONRC), The Stanford Platform Lab, National Science Foun-dation (NSF) Awards CNS-1531281, CNS-1162112, and agenerous gift from Intel.

References[1] P4 program for OVS, June 2015. https://github.com/blp/

ovs-reviews/blob/p4-workshop/tests/ovs.p4.[2] P4-vSwitch. https://github.com/P4-vSwitch, 2016.[3] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Tech-

niques, and Tools. Addison Wesley, 1986.[4] M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu,

A. Fingerhut, V. T. Lam, F. Matus, R. Pan, N. Yadav, and G. Varghese.CONGA: Distributed Congestion-aware Load Balancing for Datacen-ters. In ACM SIGCOMM, 2014.

[5] D. G. Andersen, H. Balakrishnan, N. Feamster, T. Koponen, D. Moon,and S. Shenker. Accountable Internet Protocol (AIP). In ACM SIG-COMM, 2008.

[6] M. B. Anwer, M. Motiwala, M. b. Tariq, and N. Feamster. Switch-Blade: A Platform for Rapid Deployment of Network Protocols onProgrammable Hardware. In ACM SIGCOMM, 2010.

[7] G. Back. DataScript: A Specification and Scripting Language for BinaryData. In ACM SIGPLAN. Springer-Verlag, 2002.

[8] W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang. Information-agnostic Flow Scheduling for Commodity Data Centers. In USENIXNSDI, 2015.

[9] A. Bianco, R. Birke, L. Giraudo, and M. Palacin. OpenFlow Switching:Data Plane Performance. In IEEE ICC, 2010.

[10] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford,C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker.P4: Programming Protocol-independent Packet Processors. ACM SIG-COMM CCR, July 2014.

[11] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Iz-zard, F. Mujica, and M. Horowitz. Forwarding Metamorphosis: FastProgrammable Match-action Processing in Hardware for SDN. In ACMSIGCOMM, 2013.

[12] G. Brebner. Programmable Hardware for Software Defined Networks.In IEEE ECOC, 2015.

[13] D. Coleman, D. Ash, B. Lowther, and P. Oman. Using Metrics toEvaluate Software System Maintainability. IEEE Computer, 1994.

[14] J. Corbet. BPF: The Universal In-kernel Virtual Machine. Linux WeeklyNews, Eklektix Inc, 2014.

[15] J. Corbet. Extending BPF. Linux Weekly News, Eklektix Inc, 2014.[16] B. Davie and J. Gross. A Stateless Transport Tunneling Protocol for

Network Virtualization (STT). Internet-Draft draft-davie-stt-08, InternetEngineering Task Force, Apr. 2016. Work in Progress.

[17] M. Dillon and T. Winters. Network Functions Virtualization inHome Networks. Technical report, Open Networking Foundation,2015. https://www.opennetworking.org/images/

https://github.com/blp/ovs-reviews/blob/p4-workshop/tests/ovs.p4

https://github.com/blp/ovs-reviews/blob/p4-workshop/tests/ovs.p4

https://github.com/P4-vSwitch

https://www.opennetworking.org/images/stories/downloads/sdn-resources /IEEE-papers/network-func-virt-in-home-networks.pdf

stories/downloads/sdn-resources/IEEE-papers/network-func-virt-in-home-networks.pdf.

[18] M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall, G. Iannaccone,A. Knies, M. Manesh, and S. Ratnasamy. RouteBricks: ExploitingParallelism to Scale Software Routers. In SOSP, 2009.

[19] N. Dukkipati, G. Gibb, N. McKeown, and J. Zhu. Building a RCP (RateControl Protocol) Test Network. In HOTI, 2007.

[20] P. Emmerich, S. Gallenmuller, D. Raumer, F. Wohlfart, and G. Carle.MoonGen: A Scriptable High-Speed Packet Generator. In IMC, 2015.

[21] P. Emmerich, D. Raumer, F. Wohlfart, and G. Carle. PerformanceCharacteristics of Virtual Switching. In IEEE CloudNet, 2014.

[22] D. Farinacci, S. P. Hanks, D. Meyer, and P. S. Traina. Generic RoutingEncapsulation (GRE). RFC 2784, Mar. 2000.

[23] K. Fisher and R. Gruber. PADS: A Domain-specific Language forProcessing Ad Hoc Data. In PLDI, 2005.

[24] N. Foster, R. Harrison, M. J. Freedman, C. Monsanto, J. Rexford,A. Story, and D. Walker. Frenetic: A Network Programming Language.In ICFP, 2011.

[25] T. M. Gil and M. Poletto. MULTOPS: A Data-structure for BandwidthAttack Detection. In USENIX Security, 2001.

[26] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri,D. A. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and FlexibleData Center Network. In ACM SIGCOMM, 2009.

[27] J. Gross. Tunnel: Add support for matching on OAM packets. Gitcommit 94872594b79d in [53], May 2014.

[28] J. Gross. Tunneling: Allow matching and setting tunnel ‘OAM’ flag.Git commit b666962be3b2 in [53], July 2015.

[29] J. Gross and I. Ganga. Geneve: Generic Network Virtualization Encap-sulation. Internet-Draft draft-ietf-nvo3-geneve-01, Internet EngineeringTask Force, Jan. 2016. Work in Progress.

[30] S. Han, K. Jang, A. Panda, S. Palkar, D. Han, and S. Ratnasamy.SoftNIC: A Software NIC to Augment Hardware. Technical ReportUCB/EECS-2015-155, UC Berkeley, May 2015.

[31] S. Han, K. Jang, K. Park, and S. Moon. PacketShader: A GPU-accelerated Software Router. In ACM SIGCOMM, 2010.

[32] N. Heirbaut and T. Van Der Storm. Two implementation techniquesfor domain specific languages compared: OMeta/JS vs. JavaScript.Master’s thesis, Universiteit van Amsterdam, 2009.

[33] D. Hiebert. Ctags User Commands Version 5.8-1. Exuberant Ctags.[34] Intel. DPDK: Data Plane Development Kit. http://dpdk.org,

2013.[35] Intel. DPDK: Programmer’s Guide, 2013. http://dpdk.org/

doc/guides/prog_guide/index.html.[36] L. Jose, L. Yan, G. Varghese, and N. McKeown. Compiling Packet

Programs to Reconfigurable Switches. In USENIX NSDI, 2015.[37] S. H. Kan. Metrics and Models in Software Quality Engineering. Addi-

son Wesley, 2nd edition, 2002.[38] C. Kaner et al. Software engineering metrics: What do they measure

and how do we know? In IEEE METRICS. CiteSeer, 2004.[39] D. Katabi, M. Handley, and C. Rohrs. Congestion Control for High

Bandwidth-delay Product Networks. In ACM SIGCOMM, 2002.[40] N. Katta, M. Hira, C. Kim, A. Sivaraman, and J. Rexford. HULA:

Scalable Load Balancing Using Programmable Data Planes. In SOSR,2016.

[41] C. Kim. Programming the Network Dataplane in P4, 2016. http://netseminar.stanford.edu/03_31_16.html.

[42] C. Kim, P. Bhide, E. Doe, H. Holbrook, A. Ghanwani, D. Daly,M. Hira, and B. Davie. In-band Network Telemetry (INT),2016. http://p4.org/wp-content/uploads/fixed/INT/INT-current-spec.pdf.

[43] C. Kim, A. Sivaraman, N. Katta, A. Bas, A. Dixit, and L. J. Wobker.In-band Network Telemetry via Programmable Dataplanes. In ACMSIGCOMM, 2015. Demo.

[44] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. TheClick Modular Router. ACM TOCS, Aug. 2000.

[45] C. Lameter. NUMA (Non-Uniform Memory Access): An Overview.ACM Queue, 2013.

[46] Linux Kernel Archives. http://kernel.org, 1997.[47] M. Mahalingam, T. Sridhar, M. Bursell, L. Kreeger, C. Wright, K. Duda,

P. Agarwal, and D. Dutt. Virtual eXtensible Local Area Network(VXLAN): A Framework for Overlaying Virtualized Layer 2 Networksover Layer 3 Networks. RFC 7348, Oct. 2015.

[48] P. J. McCann and S. Chandra. Packet Types: Abstract Specification ofNetwork Protocol Messages. In ACM SIGCOMM, 2000.

[49] S. McCanne and V. Jacobson. The BSD Packet Filter: A New Architec-ture for User-level Packet Capture. In USENIX, 1993.

[50] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson,J. Rexford, S. Shenker, and J. Turner. OpenFlow: Enabling Innovationin Campus Networks. ACM SIGCOMM CCR, Mar. 2008.

[51] Network Working Group. RFC 1624: Computation of the InternetChecksum via Incremental Update, May 1994.

[52] P. Oman and J. Hagemeister. Metrics for assessing a software system’smaintainability. In Conference on Software Maintenance, 1992.

[53] Open vSwitch. https://github.com/openvswitch/ovs, Oc-tober 2015.

[54] R. Pagh and F. F. Rodler. Cuckoo hashing. Elsevier Journal of Algo-rithms, 2004.

[55] I. Pepelnjak. Packet Forwarding in Amazon VPC, Decem-ber 2013. http://blog.ipspace.net/2013/12/packet-forwarding-in-amazon-vpc.html.

[56] B. Pfaff. P4 Parsing in Open vSwitch, June 2015. P4 Workshop,http://p4workshop2015.sched.org/event/3ZQF.

[57] B. Pfaff, J. Pettit, T. Koponen, E. J. Jackson, A. Zhou, J. Rajahalme,J. Gross, A. Wang, J. Stringer, P. Shelar, K. Amidon, and M. Casado.The Design and Implementation of Open vSwitch. In USENIX NSDI,2015.

[58] S. Previdi et al. SPRING Problem Statement and Requirements.IETF, June 2015. https://datatracker.ietf.org/doc/draft-ietf-spring-problem-statement.

[59] Proxmox Virtual Environment. https://www.proxmox.com/en/proxmox-ve.

[60] P. Quinn and U. Elzur. Network Service Header. Internet-Draft draft-ietf-sfc-nsh-04, Internet Engineering Task Force, Mar. 2016. Work inProgress.

[61] J. Rajahalme. TCP flags matching support. Git commit dc235f7fbcffin [53], October 2013.

[62] J. Reich, C. Monsanto, N. Foster, J. Rexford, and D. Walker. ModularSDN Programming with Pyretic. USENIX ;login:, 2013.

[63] T. Rinta-Aho, M. Karlstedt, and M. P. Desai. The Click2NetFPGAToolchain. In USENIX ATC, 2012.

[64] L. Rizzo. Netmap: A Novel Framework for Fast Packet I/O. In USENIXATC, June 2012.

[65] M. Shahbaz and N. Feamster. The Case for an Intermediate Representa-tion for Programmable Data Planes. In SOSR, 2015.

[66] N. Shelly, E. J. Jackson, T. Koponen, N. McKeown, and J. Rajahalme.Flow Caching for High Entropy Packet Fields. In HotSDN, 2014.

[67] M. Smith and L. Kreeger. VXLAN Group Policy Option. Internet-Draftdraft-smith-vxlan-group-policy-02, Internet Engineering Task Force,Apr. 2016. Work in Progress.

[68] H. Song. Protocol-oblivious Forwarding: Unleash the Power of SDNThrough a Future-proof Forwarding Plane. In HotSDN, 2013.

[69] V. Srinivasan, S. Suri, and G. Varghese. Packet Classification UsingTuple Space Search. In ACM SIGCOMM, 1999.

[70] J. Stringer. datapath: Allow matching on conntrack label. Git commit038e34abaa31 in [53], December 2012.

[71] J. Stringer. Add connection tracking label support. Git commit9daf23484fb1 in [53], October 2013.

[72] M. Varvello, R. Laufer, F. Zhang, and T. Lakshman. Multi-Layer PacketClassification with Graphics Processing Units. In CoNEXT, 2014.

[73] Y.-S. Wang and P. Garg. NVGRE: Network Virtualization Using GenericRouting Encapsulation. RFC 7637, Oct. 2015.

[74] D. Zhou, B. Fan, H. Lim, M. Kaminsky, and D. G. Andersen. Scal-able, High Performance Ethernet Forwarding with CuckooSwitch. InCoNEXT, 2013.



http://dpdk.org

http://dpdk.org/doc/guides/prog_guide/index.html

http://dpdk.org/doc/guides/prog_guide/index.html

http://netseminar.stanford.edu/03_31_16.html

http://netseminar.stanford.edu/03_31_16.html

http://p4.org/wp-content/uploads/fixed/INT/INT-current-spec.pdf

http://p4.org/wp-content/uploads/fixed/INT/INT-current-spec.pdf

http://kernel.org

https://github.com/openvswitch/ovs

http://blog.ipspace.net/2013/12/packet-forwarding-in-amazon-vpc.html

http://blog.ipspace.net/2013/12/packet-forwarding-in-amazon-vpc.html

http://p4workshop2015.sched.org/event/3ZQF

https://datatracker.ietf.org/doc/draft-ietf-spring-problem-statement

https://datatracker.ietf.org/doc/draft-ietf-spring-problem-statement

https://www.proxmox.com/en/proxmox-ve

https://www.proxmox.com/en/proxmox-ve

pisces: a programmable, protocol-independent …jrex/papers/pisces16.pdfwe say that pisces is a...

Documents