fluent10g: a programmable fpga-based network tester for...

8
FlueNT10G: A Programmable FPGA-based Network Tester for Multi-10-Gigabit Ethernet Andreas Oeldemann, Thomas Wild, Andreas Herkersdorf Chair of Integrated Systems, Technical University of Munich, Germany E-mail: {andreas.oeldemann,thomas.wild,herkersdorf}@tum.de Abstract—We present FlueNT10G, an open-source FPGA- based network tester for precise replay of network traces, as well as for accurate packet capture and round-trip latency measurements. FlueNT10G streams replay and capture data between the host system and the FPGA board during active tests. It enables continuous measurements without being constrained by the memory capacity of the FPGA board. FlueNT10G is able to concurrently replay and capture traffic on three 10 Gbit/s network interfaces for all packet sizes. When operated exclusively in replay or capture mode, throughput increases to 4x 10 Gbit/s. Our design yields a temporal resolution of 6.4 ns for precise traffic pattern generation, as well as for accurate arrival timestamping and latency measurements. On the software-side, FlueNT10G is complemented by an API enabling the programmable execution of reproducible network measurements. Targeting the automated performance evaluation of different virtualized network function configurations, the API further integrates access to a bidirec- tional side-band channel for device-under-test reconfiguration and status feedback. FlueNT10G has been implemented on the NetFPGA-SUME platform (Xilinx Virtex-7 XC7VX690T) with an FPGA resource utilization of no more than 25%, which leaves sufficient capacity available for future design extensions. Keywords-network tester, packet generator, packet capture, latency measurement, FPGA, measurement automation I. I NTRODUCTION With ever-increasing data rates and the emergence of appli- cations demanding ultra-low latency communication, network processing requirements are constantly growing. When devel- oping network devices, manufacturers must precisely quantify performance characteristics to ensure that the product satisfies customer demands. Likewise, network operators are interested in benchmarking their infrastructure to identify room for performance and efficiency improvements yielding higher rev- enue. In academia, performance measurements are an essential part of evaluating the feasibility of novel networking concepts. To satisfy this demand, an entire industry including com- panies such as Spirent [1] and Ixia [2] has centered their business around providing network performance measurement tools. Unfortunately, these solutions, often based on custom- tailored hardware, exceed what research institutes or small businesses can typically afford. Towards more cost-efficient solutions, several non-commercial soft- and hardware tools for packet generation and capture have been proposed by both industry and academia [3]–[8]. However, their limitations in terms of performance, accuracy and features, which we discuss further in Section II, constrain their practical use. Fig. 1. FlueNT10G Measurement Setup Addressing these limitations, we present FlueNT10G, an open-source 1 FPGA-based network tester 2 , which is paired with a comprehensive software framework targeting the au- tomation of reproducible network measurements. FlueNT10G’s hardware architecture, which we describe in Section III, consists of two primary components: 1) A generator for the precise replay of prerecorded or synthetically created Ethernet network traces with arbitrary traffic patterns and 2) a receiver for capturing and accurately timestamping incoming traffic, as well as for per-packet latency recording. FlueNT10G concurrently replays and captures line rate traffic on three 10 Gbit/s network interfaces for all packet sizes. When operated exclusively in replay or capture mode, through- put increases to 4x 10 Gbit/s. A key feature of FlueNT10G is its ability to continuously stream replay and capture data between host machine and FPGA board during active measure- ments (fluent data movements). Thus, measurement durations are not limited by the memory capacity of the FPGA board. On the software-side, which we detail in Section IV, FlueNT10G is complemented by a library providing an Application Programming Interface (API) for the pro- grammable execution of reproducible network measurements. Targeted especially at benchmarking the performance of vir- tualized network functions on commodity devices-under-test (DuTs), it provides an optional side-band communication interface to the FlueNT10G Agent. The FlueNT10G Agent is a light-weight software executed on one or more DuTs, through 1 https://github.com/aoeldemann/fluent10g (MIT License) 2 We use the term network tester to refer to a device or software that is able to precisely generate network traffic, as well as to capture incoming traffic and to accurately record packet arrival times and latencies.

Upload: others

Post on 21-Jul-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FlueNT10G: A Programmable FPGA-based Network Tester for ...kalman.mee.tcd.ie/fpl2018/content/pdfs/FPL2018-43... · Based on the NetFPGA-SUME [14], the Open Source Network Tester (OSNT)

FlueNT10G: A Programmable FPGA-basedNetwork Tester for Multi-10-Gigabit Ethernet

Andreas Oeldemann, Thomas Wild, Andreas HerkersdorfChair of Integrated Systems, Technical University of Munich, Germany

E-mail: {andreas.oeldemann,thomas.wild,herkersdorf}@tum.de

Abstract—We present FlueNT10G, an open-source FPGA-based network tester for precise replay of network traces, aswell as for accurate packet capture and round-trip latencymeasurements. FlueNT10G streams replay and capture databetween the host system and the FPGA board during active tests.It enables continuous measurements without being constrainedby the memory capacity of the FPGA board. FlueNT10G is ableto concurrently replay and capture traffic on three 10 Gbit/snetwork interfaces for all packet sizes. When operated exclusivelyin replay or capture mode, throughput increases to 4x 10 Gbit/s.Our design yields a temporal resolution of 6.4 ns for precise trafficpattern generation, as well as for accurate arrival timestampingand latency measurements. On the software-side, FlueNT10G iscomplemented by an API enabling the programmable executionof reproducible network measurements. Targeting the automatedperformance evaluation of different virtualized network functionconfigurations, the API further integrates access to a bidirec-tional side-band channel for device-under-test reconfigurationand status feedback. FlueNT10G has been implemented on theNetFPGA-SUME platform (Xilinx Virtex-7 XC7VX690T) with anFPGA resource utilization of no more than 25%, which leavessufficient capacity available for future design extensions.

Keywords-network tester, packet generator, packet capture,latency measurement, FPGA, measurement automation

I. INTRODUCTION

With ever-increasing data rates and the emergence of appli-cations demanding ultra-low latency communication, networkprocessing requirements are constantly growing. When devel-oping network devices, manufacturers must precisely quantifyperformance characteristics to ensure that the product satisfiescustomer demands. Likewise, network operators are interestedin benchmarking their infrastructure to identify room forperformance and efficiency improvements yielding higher rev-enue. In academia, performance measurements are an essentialpart of evaluating the feasibility of novel networking concepts.

To satisfy this demand, an entire industry including com-panies such as Spirent [1] and Ixia [2] has centered theirbusiness around providing network performance measurementtools. Unfortunately, these solutions, often based on custom-tailored hardware, exceed what research institutes or smallbusinesses can typically afford. Towards more cost-efficientsolutions, several non-commercial soft- and hardware tools forpacket generation and capture have been proposed by bothindustry and academia [3]–[8]. However, their limitations interms of performance, accuracy and features, which we discussfurther in Section II, constrain their practical use.

Fig. 1. FlueNT10G Measurement Setup

Addressing these limitations, we present FlueNT10G, anopen-source1 FPGA-based network tester2, which is pairedwith a comprehensive software framework targeting the au-tomation of reproducible network measurements.

FlueNT10G’s hardware architecture, which we describein Section III, consists of two primary components: 1) Agenerator for the precise replay of prerecorded or syntheticallycreated Ethernet network traces with arbitrary traffic patternsand 2) a receiver for capturing and accurately timestampingincoming traffic, as well as for per-packet latency recording.FlueNT10G concurrently replays and captures line rate trafficon three 10 Gbit/s network interfaces for all packet sizes.When operated exclusively in replay or capture mode, through-put increases to 4x 10 Gbit/s. A key feature of FlueNT10Gis its ability to continuously stream replay and capture databetween host machine and FPGA board during active measure-ments (fluent data movements). Thus, measurement durationsare not limited by the memory capacity of the FPGA board.

On the software-side, which we detail in Section IV,FlueNT10G is complemented by a library providing anApplication Programming Interface (API) for the pro-grammable execution of reproducible network measurements.Targeted especially at benchmarking the performance of vir-tualized network functions on commodity devices-under-test(DuTs), it provides an optional side-band communicationinterface to the FlueNT10G Agent. The FlueNT10G Agent is alight-weight software executed on one or more DuTs, through

1https://github.com/aoeldemann/fluent10g (MIT License)2We use the term network tester to refer to a device or software that is able

to precisely generate network traffic, as well as to capture incoming trafficand to accurately record packet arrival times and latencies.

Page 2: FlueNT10G: A Programmable FPGA-based Network Tester for ...kalman.mee.tcd.ie/fpl2018/content/pdfs/FPL2018-43... · Based on the NetFPGA-SUME [14], the Open Source Network Tester (OSNT)

which the network tester can initiate configuration updates(e.g. replace the rule set of an intrusion detection system) andin return obtain DuT status information such as packet queuefill levels and resource loads. This enables the automatedevaluation of configuration-specific performance impacts andallows insight into the origin of performance bottlenecks.

After providing an in-depth evaluation of FlueNT10G’sreplay precision, timestamping accuracy and performance inSection V, we conclude our paper in Section VI.

To the best of our knowledge, we are the first ones tocontribute an open-source network tester enabling its user to

• precisely replay and capture traffic, as well as to ac-curately record per-packet latencies at multiple tens ofGbit/s without limitations in replay or capture data size,

• automate reproducible network measurements that in-clude the orchestration of DuT configuration updates, and

• augment measurement results with DuT feedback data.

II. STATE OF THE ART

Due to the high price of commercial network testers, nu-merous software solutions for packet generation and capturehave been proposed in the past (e.g. [7], [8]). Unfortunately,enforcing precise inter-packet transmission times and accu-rately timestamping packets on arrival using pure software-based tools comes with problems: Modern network interfacecard (NIC) drivers rely on batch processing to achieve highthroughput, therefore distorting inter-packet times. Even with-out batch processing, asynchronous DMA transfers and PCIExpress transmission delays introduce unpredictable jitter [4].

Several publications have proposed hardware support fortransmission rate control, timestamping of captured packetsand latency measurements. Table I compares the implemen-tations most closely related to FlueNT10G (we limit ourcomparison to implementations targeting data rates of at least10 Gbit/s, but refer the reader to [12], [13] for prior work forlower data rates).

Based on the NetFPGA-SUME [14], the Open SourceNetwork Tester (OSNT) [3], [9] relies on an open-sourceFPGA design to replay network traces and to capture packets

at data rates of up to 4x 10 Gbit/s. It supports per-packetlatency measurements with a 6.4 ns time resolution. In contrastto FlueNT10G, the entire trace file must be preloaded to thememory of the FPGA board (max. size of 16 GByte [14])before a test is started. Therefore, replay duration is limited toa few seconds at line rate (however, replaying the same tracemultiple times is supported). The authors report that the trans-fer of a 1 GByte trace file to the FPGA board takes close tohalf an hour [9], making even short tests time-consuming. WithGPS-based time synchronization of multiple OSNT instances,latency measurements where packet generator and receiver arelocated on different FPGA boards are supported.

OFLOPS-Turbo [11] introduces a unified measurement plat-form combining OSNT for hardware-supported traffic replayand capture with a software framework for programmablecontrol over OpenFlow switches. In contrast to FlueNT10G,OFLOPS-Turbo is tailored for the test of OpenFlow switchesand is not generic enough to be applicable for other use cases.

TNT10G [5] and iTester [6] are FPGA-based implemen-tations as well. While TNT10G supports traffic replay andcapture at 10 Gbit/s (replay and capture cannot be performedsimultaneously), iTester solely focuses on 10 Gbit/s tracereplay. Both proposals stream data between host and FPGAboard during measurements, but unlike FlueNT10G do notaddress latency measurements or automation.

In a more generic approach, SoNIC [10] aims at moving thephysical and data link layers from hard- to software. Using anFPGA-based implementation, a raw bitstream is exchangedbetween software and the optical transceivers. One of theproposed use cases is packet generation and capture with sub-nanosecond precision and accuracy. The authors leave it openif SoNIC can be used for latency measurements and networktrace replay. While the source code is claimed to be open, itis not available on the project website at the time of writing.

Instead of using an FPGA, MoonGen [4] makes use of com-modity Intel 10 GbE NIC features. For packet transmission,NIC queues can be configured to be drained at constant bitrate (CBR). For non-CBR traffic, Ethernet frames with aninvalid checksum are inserted in idle gaps and (if supported)

TABLE ICOMPARISON TO RELATED WORK

FlueNT10G OSNT [3], [9] TNT10G [5] iTester [6] SoNIC [10] MoonGen [4]Hardware Support FPGA FPGA FPGA FPGA FPGA NIC featuresGeneration Throughput

- Streaming 3x 10 Gbit/s† – 1x 10 Gbit/s 1x 10 Gbit/s 2x 10 Gbit/s 10 Gbit/s / CPU core- Preloaded 4x 10 Gbit/s 4x 10 Gbit/s – – – –

Capture Throughput 3x 10 Gbit/s† 4x 10 Gbit/s 1x 10 Gbit/s – 2x 10 Gbit/s –Generation Methodology Trace replay‡ Trace replay‡ Trace replay‡ Trace replay‡ Synthetic traffic Synthetic traffic‖

Maximum Replay Data Size not limited 16 GByte not limited not limited not applicable not applicableLatency Measurement Each packet

6.4 ns accuracyEach packet

6.4 ns accuracy– – – 1 packet / RTT

6.4 ns accuracySide-Band Channel for DuTConfiguration & Feedback

Generic agentsoftware library

Tailored forOpenFlow in [11]

– – – –

Multi-Node Timestamp Sync – GPS-based – – – –Open-Source Yes Yes No No Yes∗ Yes† 4x 10 Gbit/s when operated exclusively in replay or capture mode ‡ Prerecorded or synthetically generated traces‖ Online repository contains source code for trace replay (without performance evaluation) ∗ Not available online on project website

Page 3: FlueNT10G: A Programmable FPGA-based Network Tester for ...kalman.mee.tcd.ie/fpl2018/content/pdfs/FPL2018-43... · Based on the NetFPGA-SUME [14], the Open Source Network Tester (OSNT)

dropped by the DuT. For latency measurement, the NIC’sPrecision Time Protocol (PTP) capabilities are exploited totimestamp custom-crafted PTP packets. However, due to NIChardware limitations, MoonGen can only timestamp one PTPpacket per round-trip time at data rates beyond one Gbit/s, thusconstraining the time resolution at which latency variations canbe observed. For the same reason, MoonGen does not providethe means to accurately timestamp each arriving packet atmulti-gigabit link speeds. Hardware timestamping and ratecontrol is limited to Intel NICs.

Finally, we point out that companies such as Napatech [15]and Netcope [16] offer FPGA-based products for networkperformance measurement and analysis. In contrast to thesecommercial solutions, FlueNT10G is fully open-source al-lowing users to customize the design to fit their demands.While initially developed for the widely-used NetFPGA-SUME board, FlueNT10G can be ported to other FPGA boardswith modest development effort.

III. HARDWARE ARCHITECTURE

FlueNT10G relies on an FPGA hardware design to replay,capture and timestamp packets at multiple tens of gigabitper second. Figure 2 depicts both transmit (TX) and receive(RX) data paths for a single network interface, as well as theinterface to the host system and memories. We first introducehow data is exchanged between host system and FPGA. Wethen give a detailed description of the TX and RX data paths.

A. Host - FPGA data transfers

Trace and capture data is transferred between the hostsystem running the software and the FPGA board via a PCIExpress interface. FlueNT10G has been implemented on theNetFPGA-SUME [14] board, which features an eight-laneGen3 PCI Express interface providing a nominal duplex datarate of approximately 64 Gbit/s, enough to saturate all four10 Gbit/s network interfaces. A key feature of FlueNT10Gis its support to stream replay and capture data between hostsystem and FPGA board during active measurements. There-fore, FlueNT10G can utilize the full memory and (solid-statedrive array) storage capacity of the host to store data whosesize exceeds the capacity of the DRAMs of the FPGA board(max. 16 GByte for the NetFPGA-SUME [14]). To absorbthroughput variations between host and FPGA and to allow alarge DMA transfer size (64 MByte in our implementation),we buffer data in the DRAM of the FPGA board beforetransmission and after reception from the network interfaces,respectively. For each network interface, a TX and an RX ringbuffer are placed in a run-time configurable region in one ofthe two DRAMs of the FPGA board. The software monitorsthe fill levels of both buffers and initiates DMA transfersfrom host to FPGA board and vice-versa when a sufficientamount of data has been drained from the TX buffer by thereplay logic or written to the RX buffer by the capture logic.Configuration and status registers of each hardware module, aswell as ring buffer read/write pointer positions, are accessibleby the software through a PCI Express Base Address Register.

Fig. 2. Hardware data path (TX and RX) for a single network interface, PCIExpress interface to host system and ring buffers located in DRAM

All replay and capture data traverse one of the DRAMsof the FPGA board. At duplex 10 Gbit/s line rate, the totalmemory read and write data rate per network interface addsup to approx. 40 Gbit/s (2x 10 Gbit/s for reading trace andcapture data, 2x 10 Gbit/s for writing). We reuse the memorycontroller configuration provided with the NetFPGA-SUMEreference designs, which clocks the memory controllers of thetwo 64 bit-wide DDR3 DRAMs at 757.57 MHz. The nominalaggregated memory bandwidth is 193.94 Gbit/s. Our design istailored to perform large burst accesses to incremental memoryaddresses to utilize the bandwidth as efficiently as possible.

B. Trace Replay

When trace replay is started (replay on all interfaces can bestarted in sync), data is continuously read from the assignedTX ring buffer via an AXI4 interface connected to the memorycontrollers of the DRAMs. To efficiently utilize the intercon-nect and memory bandwidth, data is transferred in bursts of16 KByte and stored in a 64 KByte Block RAM FIFO. ThePacket Assembly module reads the raw data from the FIFO andreconstructs packets. It utilizes per-packet meta information(8 Byte, prepended to packet data by the software):

• Inter-packet transmission time (32 bit): number ofclock cycles until next packet shall be transmitted

• Wire Length (11 bit): length of packet on the wire• Snap Length (11 bit): length of packet data included in

the traceIf the wire length of the packet is larger than the packet data

included in the trace, the module appends Wire Length −Snap Length zero bytes to restore the original packet size.This feature is helpful for measurement scenarios in which thepayload of the generated packets is of no interest (e.g. bench-marking the performance of an L2 switch or L3 router),because it reduces the amount of data that must be transferred

Page 4: FlueNT10G: A Programmable FPGA-based Network Tester for ...kalman.mee.tcd.ie/fpl2018/content/pdfs/FPL2018-43... · Based on the NetFPGA-SUME [14], the Open Source Network Tester (OSNT)

between host and FPGA (yielding higher achievable replayand capture throughput). After the packet has been assembled,it is transferred through the TX pipeline via AXI4-Streaminterfaces. The inter-packet transmission time for each packetis passed along via the AXI4-Stream TUSER side-band signal.

The Rate Control module enforces the inter-packet trans-mission times that are specified in the trace file. When thefirst data word of a packet is transmitted on the AXI4-Streammaster interface, a clock cycle counter is started. As soon asit reaches the value that has been specified in the meta data,transmission of the next packet is started and the processrepeats. If the hardware is unable to maintain the timingspecified in the trace, an error register is set to notify thesoftware. This happens if data cannot be read from the TX ringbuffer fast enough or if the network interface signals a backlog.To achieve high replay precision, the rate control moduleoperates in the same clock domain as the Ethernet MediaAccess Control (MAC) IP cores and the physical networkinterfaces, eliminating the need for clock domain crossingsand thus timing variations down the pipeline. It is clocked at156.25 MHz, yielding a replay time precision of 6.4 ns.

Finally, if latency measurements are enabled, the TXTimestamping module inserts the current value of a timestampcounter into the packet data. Based on this timestamp value,the network tester is able to determine the packet latency whenit arrives back in the RX path. The bit width and insert positioncan be configured by software. It can either be placed in a fixedbyte position or in a selectable header field on a per-protocolbasis (e.g. IPv4 checksum, IPv6 flow label, if these fieldsare not inspected/set by the DuT). By default, the timestampcounter is incremented every clock cycle. If the expectedlatency of the DuT exceeds the value range of the insertedtimestamp, the counter can be configured to be incremented atslower rates. However, this comes with a reduction of latencymeasurement accuracy. Timestamping is implemented rightbefore the packets are passed to the Ethernet MAC to obtainhighest measurement accuracy. Each AXI4-Stream data wordreceived on the slave interface is valid on the master interfacein the next clock cycle. Thus, the inter-packet transmissiontimes determined by the rate control module are maintained.

C. Packet Capture

As soon as a packet arrives from the network interface,the RX Timestamping module reads and resets an inter-packetarrival time counter. The counter value specifies the number ofclock cycles that have passed since the last packet reception. Iflatency measurement is enabled, the timestamp inserted on theTX side is extracted from the packet data and the latency is cal-culated by comparing it to the current value of the timestampcounter. Both packet data and time information are passed onto further modules via an AXI4-Stream interface. The TX andRX timestamping modules on all four interfaces are operatedin the same clock domain and thus are synchronous. At aclock frequency of 156.25 MHz, our design yields a temporalresolution of 6.4 ns for both latency and inter-packet arrivaltime measurements.

Only packets that have been generated by the network testercontain valid timestamps and may be included in the latencyevaluation. Although in many isolated measurement setups noother packets arrive at the network interfaces, the user mayoptionally enable the Packet Filter module to filter arrivingpackets based on a configurable Ethernet MAC address range.

The Packet Capture module prepends 8 Byte of metainformation to the captured packet data and writes both toa 64 KByte Block RAM FIFO. The meta information include

• Inter-packet arrival time (28 bit): number of clockcycles since last packet reception

• Latency (25 bit): calculated packet latency• Wire length (11 bit): length of the packet on the wireTo reduce the amount of data that is transferred between

FPGA and host, the software may specify the maximum packetcapture length, which is applied to all packets. Before writingpacket data to the Block RAM FIFO, data exceeding theconfigured capture length is cut off. From the Block RAMFIFO, data is moved to the RX ring buffer in DRAM using16 KByte burst transfers via an AXI4 interconnect.

D. IP Cores

Our hardware design utilizes several Xilinx IP cores:DMA/Bridge Subsystem for PCI Express v3.1, 10G EthernetMAC v15.1, 10G Ethernet PCS/PMA v6.0 and parts of the AXIInterconnect v2.1 suite. Except the 10G Ethernet MAC, all IPcores come with the Xilinx Vivado Design Suite without ad-ditional licensing fees. Our custom IP cores interface externallogic using AXI4, AXI4-Lite and AXI4-Stream interfaces.

IV. SOFTWARE FRAMEWORK

The hardware design is complemented by a software li-brary, which allows programmable control over measurementconfiguration, execution and evaluation. After introducingFlueNT10G’s Application Programming Interface, we presentits capabilities targeting the automation of device-under-testconfiguration and feedback notification, as well as the mech-anisms for data exchange between host system and FPGA.

A. Application Programming Interface

Instead of providing a graphical user interface,FlueNT10G’s software library offers an ApplicationProgramming Interface (API) allowing its user to programmeasurement applications. This has two key advantages:automation and reproducibility. Measurement applications canutilize all constructs of the underlying programming language(e.g. loops, branches), thus allowing the iterative executionof measurements based on results obtained in previous runs.Measurement applications can be stored and re-run at alater point in time to reproduce the network test. Besidesfunctionality for reading pcap network trace files (includingsupport for nanosecond resolution extensions), we provideseveral functions for network trace generation (constant bitrate and random traffic patterns), as well as for measurementevaluation (e.g. mean latency calculation, creation of latencyhistograms, throughput graphs). Similar to reading the replay

Page 5: FlueNT10G: A Programmable FPGA-based Network Tester for ...kalman.mee.tcd.ie/fpl2018/content/pdfs/FPL2018-43... · Based on the NetFPGA-SUME [14], the Open Source Network Tester (OSNT)

1 package main2 import("fluent10g")3

4 func main() {5 // attach to hardware; get generator on interface 0,6 // receiver on interface 17 nt := fluent10g.NetworkTesterCreate()8 gen := nt.GetGenerator(0)9 recv := nt.GetReceiver(1)

10

11 recv.EnableCapture() // enable capturing12

13 datarate := 1e9 // 1 Gbit/s14

15 for {16 // 30s synthetic CBR traffic trace, 64 Byte packets17 trace := fluent10g.CreateTraceCBR(datarate, 64, 30)18

19 // assign trace to generator20 gen.SetTrace(trace)21

22 nt.StartCapture() // start capturing (non-blocking)23 nt.StartReplay() // start replay (blocking)24 nt.StopCapture() // stop capturing25

26 // did all transmitted packets arrive back?27 if recv.GetNPkts() < trace.GetNPkts() {28 break // capacity reached, done!29 } else {30 datarate += 1e9 // increment data rate31 }32 }33 }

Listing 1. Example measurement application, which iteratively deter-mines the maximum sustainable throughput of a device-under-test.

traces, captured data can be exported to standard trace fileformats for further evaluation in tools such as Wireshark.

Our software library has been developed in Go, a compiled,statically typed programming language created by Googleemployees in 2009. We picked Go, because its large standardlibrary enables rapid development and its performance is supe-rior compared to other widely used languages such as Python.Listing 1 shows an example application, which benchmarksthe maximum throughput of a DuT. The application generatesa CBR traffic trace (starting at a data rate of 1 Gbit/s) andinitiates replay on interface 0. As long as all packets arriveback on interface 1, the data rate is incremented by 1 Gbit/sand the measurement is repeated. When the capacity limit ofthe DuT is reached and packets are lost, the application quits.

B. FlueNT10G Agent

In many cases, the performance characteristics of a device-under-test depend on its configuration. For example, the max-imum throughput of an intrusion detection system is impactedby the size and complexity of its rule set, the performance ofa software switch typically decreases with a growing numberof installed forwarding table entries [17]. To facilitate theautomated characterization of these configuration-dependentimpacts, FlueNT10G provides an optional side-band channelto the DuT, through which a measurement application caninitiate reconfigurations and obtain feedback information byissuing API function calls. An intermediary software runningon one or more DuT(s) called the FlueNT10G Agent triggersthe configuration operations and transmits information back

Fig. 3. Interaction of FlueNT10G Agent with a virtualized network function

to the measurement application. The side-band channel isimplemented using ZeroMQ [18], a widely-used asynchronousmessaging library with bindings for popular programminglanguages such as C, Python and Go. Since execution of theagent requires the possibility to execute arbitrary softwareon the DuT, it primarily targets benchmarks of virtualizednetwork functions running on commodity computers.

Figure 3 depicts the interaction of the FlueNT10G Agentand a virtualized network function. When a measurementapplication triggers a DuT reconfiguration by issuing an APIcall, FlueNT10G’s software library creates a control messageand sends it to the agent. After receiving the message from theZeroMQ library, the Config Event Dispatch module classifiesits type and calls one (of possibly multiple) configurationcallback functions. The callback functions are user-definedand may contain arbitrary code performing the actual recon-figuration of the virtualized network function (e.g. load anintrusion detection rule set, adjust the number of assignedCPU cores, change packet batch processing size). In thereverse direction, feedback information issued by a user-defined monitor function can be passed to the Monitor Notifi-cation module, which transmits data back to the measurementapplication (e.g. packet queue lengths, CPU utilization, cachemiss rates). We explicitly decided to leave the implementationof the configuration and monitor functions up to the user,because it allows the FlueNT10G Agent to provide a genericinterface for interaction between measurement application anda diverse set of (virtualized) DuTs. Together with several usageexamples, we provide agent implementations both in C andPython. While the former can be directly compiled into C-based network functions, the latter is ideal for performingsimple reconfiguration tasks such as issuing a command-lineinstruction (e.g. turn CPU cores on/off). ZeroMQ performsblocking reads on a TCP socket to receive data. If no datais available and an interrupt-based networking driver is usedfor the interface receiving control messages, the agent isnot scheduled on the CPU and thus does not impact theperformance of the executed network function.

Page 6: FlueNT10G: A Programmable FPGA-based Network Tester for ...kalman.mee.tcd.ie/fpl2018/content/pdfs/FPL2018-43... · Based on the NetFPGA-SUME [14], the Open Source Network Tester (OSNT)

C. Host - FPGA data transfers

FlueNT10G uses the Xilinx DMA/Bridge Subsystem forPCI Express to move data between host and FPGA. Whenreplay or capture are started, the software polls the fill levelsof the ring buffers through PCI Express Base Address Registerreads. As soon as data can be transferred, a DMA copy isinitiated and the respective ring buffer pointer is updated. TheLinux driver provided by Xilinx offers character devices (onefor reads, one for writes), which can be accessed from user-space. On our Dell PowerEdge T630 workstation with an IntelXeon E5-2620 v3 CPU we measured the user-space read andwrite data rates to the DRAMs of the FPGA board via asingle DMA channel at 52 Gbit/s and 47 Gbit/s, respectively(64 MByte transfer size, poll-mode driver, Linux kernel 4.4.0).

V. HARDWARE EVALUATION

In this section, we evaluate the FPGA implementation ofFlueNT10G. We first examine packet replay precision andtimestamping accuracy, because they are the key motivationfor our hardware design. Afterwards, we detail the achievablethroughput and summarize the FPGA resource utilization.

A. Packet Replay Precision

In contrast to pure software-based solutions, the packetgeneration logic of FlueNT10G is closely coupled with thephysical network interfaces. At a clock period of 6.4 ns, itallows fine-grained control over the precise moment whenpackets are handed to the MAC, from where they are trans-mitted on the link. While checking whether the generatedpacket data matches the one specified in the trace file is simpleusing software capture tools such as DPDKCap [7], verifyingthe correctness of inter-packet transmission times is muchmore challenging. Unfortunately, our budget prohibits accessto commercial capture cards, which are able to timestamp eachpacket transmitted by our hardware implementation.

Following a similar approach as [4], we instead exploitthe Precision Time Protocol (PTP) capabilities of the IntelX710 10 GbE network interface card [19] to timestamp thepackets generated by FlueNT10G. Whenever the NIC receivesa PTP packet, it writes the current time in one out of fourRX timestamp registers (the timestamping logic of the NICis clocked at 312.5 MHz, providing a resolution of 3.2 ns).Each register remains in a locked state (i.e. its value is notoverwritten) until it is read by software. When all four registersare occupied, no new timestamps are taken. Exploiting thismechanism, we are able to quantify the inter-packet arrivaltimes between four consecutive PTP packets. Since the timethat passes from packet timestamping until the registers areread by the software exceeds the inter-packet arrival time athigh data rates (packet batching, PCI Express transmissiondelays), PTP packet bursts must be sufficiently spaced toobtain meaningful timestamp values (see Figure 4).

We developed an evaluation software, which reads and savesthe RX timestamp register values for the PTP packets receivedfrom FlueNT10G. We generate and replay a 60 s long networktrace with uniformly distributed packet sizes between 64 and

Fig. 4. Inter-packet transmission time verification using the Precision TimeProtocol (PTP) timestamping mechanism of the Intel X710 NIC

−12.8 −9.6 −6.4 −3.2 0 3.2 6.4 9.6 12.8

10%

20%

30%

0.06% 0.18%

24.59% 24.94% 25.30% 24.80%

0.10% 0.02% 0.01%

Absolute Deviation from Expected Inter-Packet Transmission Time [ns]

Prob

abili

ty

Fig. 5. Precision of inter-packet transmission times

1518 Byte, as well as exponentially distributed inter-packetgaps. The mean value of the exponential distribution is setsuch that a mean data rate of 8 Gbit/s is maintained (asopposed to the maximum rate of 10 Gbit/s, which wouldresult in back-to-back packet transmissions without data ratevariations). The randomly generated inter-packet transmissiontimes are floating point numbers, which are rounded up ordown to the next multiple of clock cycles by our softwarelibrary. To prevent FlueNT10G from constantly sending toofast (always rounding down) or too slow (always rounding up),the software tracks and minimizes the accumulated roundingerror by selectively rounding up or down on a per-packet basis.

Without altering the inter-packet transmission times orpacket sizes, we replace four subsequent trace packets by PTPpackets every 75 us and keep track of their expected inter-packet transmission times. We found that inserting PTP burstsmore frequently does not give the evaluation software enoughtime to read and thus unlock the timestamping registers beforethe next burst of PTP packets arrives. Out of the 74 milliongenerated packets, approx. 3.1 million packets are PTP packets(∼ 4.2%). Since each burst of four PTP packets allows thecalculation of three inter-packet times, we obtain more than2.3 million values over the replay period of 60 s.

We quantify the replay precision of our hardware implemen-tation by subtracting the expected inter-packet transmissiontimes (before rounding to clock cycle multiples) from theinter-packet arrival times recorded by our evaluation software.Figure 5 depicts the distribution of absolute inter-packettransmission time errors at a resolution of 3.2 ns (matchingthe clock period of the timestamping logic of the NIC). Wefound that 99.63% of the measured inter-packet arrival timesdo not deviate from the expected inter-packet transmissiontimes by more than 6.4 ns. Since our replay logic is clockedevery 6.4 ns, this result matches our expectation. Due tothe rounding mechanism introduced above, the recorded errorvalues appear uniformly distributed between -6.4 and 6.4 ns.We found that 0.37% of recorded inter-packet arrival timesdeviate from our expected values by up to 16 ns. While wedo not have a definitive explanation for this phenomena at the

Page 7: FlueNT10G: A Programmable FPGA-based Network Tester for ...kalman.mee.tcd.ie/fpl2018/content/pdfs/FPL2018-43... · Based on the NetFPGA-SUME [14], the Open Source Network Tester (OSNT)

20%40%60%80%

2.50%

73.54%

23.96%

CBR: 10 Gbit/s64 Byte packets

20%40%60%80%

24.69%

74.84%

0.47%Prob

abili

ty CBR: 10 Gbit/s1518 Byte packets

396.8 403.2 409.6 416 422.4

20%40%60%80%

1.16%17.88%

29.83%47.23%

3.90%

Recorded Loopback Latency [ns]

Random:8 Gbit/s mean

Fig. 6. Loop-back latency measured for different traffic patterns

time of writing, we suspect that the outliers are caused by arare delay added by either the transceivers of the FPGA boardor by the Intel NIC. We note that the TX path of the Xilinx10G Ethernet MAC is specified with a fixed latency [20].

B. Timestamping Accuracy

We verify the timestamp accuracy of FlueNT10G by per-forming loop-back latency measurements. One network in-terface replays traffic and inserts timestamps, another onecaptures the traffic and calculates packet latencies. Both net-work interfaces are directly connected with 10GBASE-SRSFP+ transceivers and multimode OM3 fibers. We performmeasurements for constant bit rate (CBR) traffic, as well asfor the random traffic pattern we described in Section V-A.Figure 6 shows the obtained latency distributions for 10 Gbit/sCBR traffic with fixed packet sizes of 64 and 1518 Byte, aswell as for random traffic with a mean data rate of 8 Gbit/s.The obtained values are multiples of the 6.4 ns clock periodof our design. The measurements show that the minimum andmaximum recorded latencies differ by up to 25.6 ns. Lookinginto the origin of the variations, we compared our results withthe ones obtained in measurements performed using the OpenSource Network Tester (see Section II) and observed the samevariations. We concluded that the jitter is not introduced byour design, but instead is caused by the Ethernet subsystem ofthe FPGA board. We suspect that the origin is the RX ElasticBuffer in the Xilinx 10G Ethernet PCS/PMA IP core [20],which according to its specification can introduce latency vari-ations of up to 86.4 ns (although we did not see values largerthan 32 ns in our measurements). While Xilinx UltraScaleFPGAs allow the RX buffer to be removed, our Virtex 7 Seriesboard does not support this feature. We plan to port our designto an UltraScale board in the future to further investigatethe accuracy of our latency measurements. To correct theabsolute latency values reported by FlueNT10G, we calibratedour software library to subtract the median round-trip latencyobtained in our loop-back measurements (409.6 ns) from alllatencies recorded by our hardware implementation.

64 300 500 700 900 1,100 1,300 1,518

30

35

40

Packet Size [Byte]

Max

imum

Thr

ough

put

[Gbi

t/s]

4 Ports: Replay (Simplex throughput)4 Ports: Replay & Capture (Duplex throughput)

Fig. 7. Maximum replay and capture throughput

64 300 500 700 900 1,100 1,300 1,518

135140145150155160

Packet Size [Byte]

Mem

ory

Ban

dwid

th[G

bit/s

]

Required Bandwidth Measured Bandwidth

Fig. 8. Comparison of required and measured memory bandwidth forconcurrent replay and capture at duplex 4x 10 Gbit/s line rate

C. Performance

To benchmark the maximum replay and capture throughput,we configure FlueNT10G to replay CBR traffic on all fournetwork interfaces. Ports are looped pair-wise, such that everyinterface is receiving the traffic generated by its peer. Weperform measurement runs for packet sizes ranging between64 and 1518 Byte. A run is considered successful, if the inter-packet transmission times specified in the trace are maintainedand capture data is transferred to the host without packetloss. Using the bisection method, we adapt the data rate ofthe generated traces to approach the maximum achievablethroughput with an accuracy of 10 Mbps. For replay-onlymeasurements, we place two TX ring buffers in DRAM 0 andtwo TX ring buffers in DRAM 1. For concurrent replay andcapture, we place all four TX ring buffers in DRAM 0 and allfour RX ring buffers in DRAM 1. The capacity of 4 GByteper DRAM is evenly distributed among the ring buffers.

Figure 7 shows the maximum achievable throughput forreplay, as well as for concurrent replay and capture on all four10 Gbit/s interfaces. While FlueNT10G is able to maintain linerate for all packet sizes when exclusively replaying, throughputslightly drops below duplex 4x 10 Gbit/s with increasingpacket sizes when data is captured as well. To explain thisbehavior, Figure 8 compares the memory bandwidth, which isrequired for concurrent replay and capture at 4x 10 Gbit/s, withthe values we obtained in our measurements. With increasingpacket sizes, less meta information need to be transferredbetween host and FPGA board (8 Byte per replay and capturepacket). However, the links become more efficiently utilized.The number of Ethernet preambles, start of frame delimitersand inter-frame gaps decreases (20 Byte per packet, insertedby the MAC), requiring more actual trace and capture data to

Page 8: FlueNT10G: A Programmable FPGA-based Network Tester for ...kalman.mee.tcd.ie/fpl2018/content/pdfs/FPL2018-43... · Based on the NetFPGA-SUME [14], the Open Source Network Tester (OSNT)

TABLE IIFPGA RESOURCE UTILIZATION ON XILINX VIRTEX-7 XC7VX690T

Used UtilizationSlice LUTs 102,636 23.69%Slice Registers 103,931 12.00%Block RAM 10,332 KBit 19.52%

be exchanged between host and FPGA board via the DRAMs.The memory bandwidth saturates at approx. 148 Gbit/s (76%of the nominal bandwidth, see Section III-A), thus causing thedata rate to drop. Replay-only operation is not memory-bound,because the TX ring buffers are distributed to both DRAMs.

D. FPGA Utilization

Table II summarizes the FPGA resource utilization for theXilinx Virtex-7 XC7VX690T FPGA. It comprises the fullFlueNT10G implementation for replay and capture on all fournetwork interfaces. With no resource type exceeding the 25%mark, there is sufficient headroom for future design extensions.

E. Reproducible Research

To allow the independent reproduction and verification ofthe presented results, all applications, evaluation scripts anddata sets are publicly available with our source code.

VI. CONCLUSION

We presented FlueNT10G, an open-source network testerfor replaying network traces, capturing traffic and performingper-packet latency measurements. Empowered by its FPGAimplementation, our design is able to concurrently replay andcapture network traffic on three 10 Gbit/s interfaces at linerate for all packet sizes. Data is streamed between host systemand FPGA board, thus allowing long measurement durations.FlueNT10G is complemented by a software library, whichenables programmable control over measurements, as well asover device-under-test configuration and status retrieval. Webelieve that our combination of hard- and software makesFlueNT10G a powerful tool for many test scenarios in whichprecise packet generation, high per-packet timestamping accu-racy and an automated measurement process are required.

FlueNT10G’s performance is memory-bound. While weprovide mechanisms to reduce the memory bandwidth demand(e.g. reducing the replay data size by replacing packet payloadswith zero bytes, cutting capture data on arrival), we plan toinvestigate further room for efficiency improvements. Concur-rent replay and capture on all four interfaces at line rate ispossible, if traces are preloaded before a measurement.

Even though we carefully designed and verified our designto meet a timestamping accuracy of 6.4 ns, we found thatthe 10G Ethernet PCS/PMA IP core introduces small latencyvariations, which are beyond our control. We plan to port ourdesign to an UltraScale board and to re-evaluate the accuracy.Since all our custom IP cores use standard AXI interfaces, weexpect the process of porting the design to be effortless.

We finally point out that FlueNT10G does not aim tocompete with commercial network testers. Features such as

the generation of stateful TCP or higher layer traffic, whichdepends on responses received from the device-under-test, areout of the scope of what FlueNT10G can offer. However,FlueNT10G itself is free (both in terms of cost and liberty).While it requires an FPGA and a license for the 10G EthernetMAC IP core, we believe that the overall price makes it par-ticular interesting for researchers working on a tight budget.

ACKNOWLEDGMENTS

The authors would like to thank Spyridon Poursalidis andStefan Keller for many helpful discussions, as well as theanonymous reviewers for their valuable feedback.

REFERENCES

[1] “Spirent Communications,” Accessed 2018-Jul-09. [Online]. Available:https://www.spirent.com

[2] “Ixia,” Accessed 2018-Jul-09. [Online]. Available: https://ixiacom.com[3] G. Antichi, M. Shahbaz, Y. Geng, N. Zilberman, A. Covington,

M. Bruyere, N. McKeown, N. Feamster, B. Felderman, M. Blott, A. W.Moore, and P. Owezarski, “OSNT: Open source network tester,” IEEENetwork, vol. 28, no. 5, pp. 6–12, 2014.

[4] P. Emmerich, S. Gallenmuller, D. Raumer, F. Wohlfart, and G. Carle,“MoonGen: A Scriptable High-Speed Packet Generator,” in Proc. of the2015 Internet Measurement Conf., Tokyo, Japan, pp. 275–287.

[5] J. F. Zazo, M. Forconesi, S. Lopez-Buedo, G. Sutter, and J. Aracil,“TNT10G: A high-accuracy 10 GbE traffic player and recorder for multi-Terabyte traces,” in Proc. of the 2014 Int. Conf. on ReConFigurableComputing and FPGAs, Cancun, Mexico, pp. 1–6.

[6] F. Zhang, Y. Xie, J. Liu, L. Luo, Q. Ning, and X. Wu, “ITester: A FPGAbased high performance traffic replay tool,” in Proc. of the 22nd Int.Conf. on Field Programmable Logic and Applications, Oslo, Norway,2012, pp. 699–702.

[7] “DPDKCap,” Accessed 2018-Jul-09. [Online]. Available: https://github.com/dpdkcap/dpdkcap

[8] K. Wiles, “Pktgen-DPDK,” Accessed 2018-Jul-09. [Online]. Available:http://www.dpdk.org/browse/apps/pktgen-dpdk

[9] NetFPGA OSNT project, “OSNT SUME extmem project,”Accessed 2018-Jul-09. [Online]. Available: https://github.com/NetFPGA/OSNT-Public/wiki/OSNT-SUME-extmem-project

[10] K.-S. Lee, H. Wang, and H. Weatherspoon, “SoNIC: Precise RealtimeSoftware Access and Control of Wired Networks,” in Proc. of the10th USENIX Conf. on Networked Systems Design and Implementation,Lombard, IL, USA, 2013, pp. 213–225.

[11] C. Rotsos, G. Antichi, M. Bruyere, P. Owezarski, and A. W. Moore,“OFLOPS-Turbo: Testing the next-generation OpenFlow switch,” inProc. of the 2015 IEEE Int. Conf. on Communications, London, UK,pp. 5571–5576.

[12] G. A. Covington, G. Gibb, J. W. Lockwood, and N. McKeown, “Apacket generator on the NetFPGA platform,” in Proc. of the 200917th IEEE Symp. on Field Programmable Custom Computing Machines,Napa, CA, USA, pp. 235–238.

[13] A. Tockhorn, P. Danielis, and D. Timmermann, “A Configurable FPGA-Based Traffic Generator for High-Performance Tests of Packet Process-ing Systems,” in Proc. of the 6th Int. Conf. on Internet Monitoring andProtection, St. Maarten, The Netherlands, 2011, pp. 14–19.

[14] N. Zilberman, Y. Audzevich, G. A. Covington, and A. W. Moore,“NetFPGA SUME: Toward 100 Gbps as research commodity,” IEEEmicro, vol. 34, no. 5, pp. 32–41, 2014.

[15] “Napatech,” Accessed 2018-Jul-09. [Online]. Available: https://www.napatech.com

[16] “Netcope Technologies,” Accessed 2018-Jul-09. [Online]. Available:https://www.netcope.com

[17] P. Emmerich, D. Raumer, F. Wohlfart, and G. Carle, “Performancecharacteristics of virtual switching,” in Proc. of the 2014 IEEE 3rd Int.Conf. on Cloud Networking, Luxembourg, Luxembourg, pp. 120–125.

[18] iMatix Corporation, “ZeroMQ,” Accessed 2018-Jul-09. [Online].Available: http://zeromq.org

[19] Intel Ethernet Controller X710/XXV710/XL710 Datasheet, Intel Corpo-ration, February 2018, revision: 3.5.

[20] 10G Ethernet PCS/PMA v6.0, PG068, Xilinx, Inc., October 2016.