acceleration of network traffic

71
Na tomto místě bude oficiální zadání vaší práce Toto zadání je podepsané děkanem a vedoucím katedry, musíte si ho vyzvednout na studiijním oddělení Katedry počítačů na Karlově náměstí, v jedné odevzdané práci bude originál tohoto zadání (originál zůstává po obhajobě na katedře), ve druhé bude na stejném místě neověřená kopie tohoto dokumentu (tato se vám vrátí po obhajobě). i

Upload: lax

Post on 05-Nov-2015

25 views

Category:

Documents


10 download

DESCRIPTION

This thesis is about finding the capabilities of WANic 56512 which is a network card witha packet processor. It researches all the steps that are required to get this card working inour laboratory. A part of this thesis discuss whether WANic 56512 is a suitable candidatefor packet generating, receiving and switching. Various benchmarks were performed to gettrustworthy results to support the final conclusion.The output of this thesis is an installation guide which speeds up the deployment timefor a future use. Also a set of tests that can be used to measure the performance of the cardis presented. At the end of this thesis there is a comparison between WANic 56512 and theprototypes made by Radim Roška and Moris Bangoura.

TRANSCRIPT

  • Na tomto mst bude oficilnzadn va prce

    Toto zadn je podepsan dkanem a vedoucm katedry, muste si ho vyzvednout na studiijnm oddlen Katedry pota na Karlov nmst, v jedn odevzdan prci bude originl tohoto zadn (originl zstv po obhajob nakatede),

    ve druh bude na stejnm mst neoven kopie tohoto dokumentu (tato se vm vrtpo obhajob).

    i

  • ii

  • Czech Technical University in PragueFaculty of Electrical Engineering

    Department of Computer Science and Engineering

    Masters Thesis

    Acceleration of 10GbE Network Traffic

    Bc. Michael Rohrbacher

    Supervisor: Ing. Jan Kubr

    Study Programme: Electrical Engineering and Information Technology

    Field of Study: Computer Science and Engineering

    December 31, 2012

  • iv

  • vAknowledgementsI would like to thank my family and my friends for their support during the time I was writingthis thesis. Also I would like to thank my supervisor Ing. Jan Kubr for the opportunity towork on this very interesting topic, and Moris Bangoura and Radim Roka for their answersI had about their thesis.

  • vi

  • vii

    DeclarationI hereby declare that I have completed this thesis independently and that I have listed allthe literature and publications used.I have no objection to usage of this work in compliance with the act 60 Zkon . 121/2000Sb.(copyright law), and with the rights connected with the copyright act including the changesin the act.

    In Klatovy on Dec 31, 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  • viii

  • Abstract

    This thesis is about finding the capabilities of WANic 56512 which is a network card witha packet processor. It researches all the steps that are required to get this card working inour laboratory. A part of this thesis discuss whether WANic 56512 is a suitable candidatefor packet generating, receiving and switching. Various benchmarks were performed to gettrustworthy results to support the final conclusion.

    The output of this thesis is an installation guide which speeds up the deployment timefor a future use. Also a set of tests that can be used to measure the performance of the cardis presented. At the end of this thesis there is a comparison between WANic 56512 and theprototypes made by Radim Roka and Moris Bangoura.

    ix

  • x

  • Contents

    1 Introduction 11.1 Motivation and goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Related works 32.1 Performance evaluation of GNU/Linux network bridge . . . . . . . . . . . . . 32.2 10GbE Routing on PC with GNU/Linux . . . . . . . . . . . . . . . . . . . . . 4

    3 Theoretical description of WANic 56512 73.1 General overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.1.1 The MIPS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.2 Comparison between the MIPS and x86 architecture . . . . . . . . . . 10

    3.2 Hardware Acceleration Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Packet Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Simple Executive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    4 Installation of WANic 56512 154.1 Description/Specification of the host system . . . . . . . . . . . . . . . . . . . 154.2 Installation procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    4.2.1 Installation procedure without buying the SDK . . . . . . . . . . . . . 164.2.1.1 Diagnostic modes kernel . . . . . . . . . . . . . . . . . . . . 164.2.1.2 cnusers SDK . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    4.2.2 Installation procedure with buying the SDK . . . . . . . . . . . . . . . 184.2.2.1 PCI console . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.2.2 NIC mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.2.3 Ethernet PCI mode . . . . . . . . . . . . . . . . . . . . . . . 23

    5 Benchmarks for WANic 56512 255.1 RFC 2544 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    5.1.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.1.2 Frame loss rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    5.2 Benchmarks for Linux environment . . . . . . . . . . . . . . . . . . . . . . . . 265.2.1 iperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2.2 netperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2.3 curl-loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2.4 pktgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    xi

  • xii CONTENTS

    5.2.5 bridge-utils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3 Benchmarks for the Simple Executive environment . . . . . . . . . . . . . . . 29

    5.3.1 traffic-gen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3.2 CrossThru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    6 Benchmarking 356.1 iperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2 netperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.3 curl-loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.4 pktgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.5 bridge-utils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.6 traffic-gen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.7 CrossThru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    7 Analysis of the benchmarks 397.1 iperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.2 netperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.3 pktgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407.4 bridge-utils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407.5 traffic-gen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407.6 CrossThru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407.7 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    8 Conclusion 478.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    A Scripts 51A.1 iperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51A.2 netperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51A.3 pktgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A.4 traffic-gen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    B CD content 57

  • List of Figures

    3.1 Block diagram of WANic 56512. Source: [8] . . . . . . . . . . . . . . . . . . . 83.2 Block diagram of OCTEON CN5650. Source: [3] . . . . . . . . . . . . . . . . 93.3 MIPS architecture, pipelined. Source: [17] . . . . . . . . . . . . . . . . . . . . 103.4 Packet input. Source: [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.5 SSO and core processing. Source: [11] . . . . . . . . . . . . . . . . . . . . . . 133.6 Packet output. Source: [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4.1 Topology of the lab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 The needed DIP switch. Source: [5] . . . . . . . . . . . . . . . . . . . . . . . 17

    5.1 CrossThru Flowchart. Source: [2] . . . . . . . . . . . . . . . . . . . . . . . . 31

    7.1 TX and RX test - TCP - iperf . . . . . . . . . . . . . . . . . . . . . . . . . . 417.2 TX test - UDP - iperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.3 TCP_STREAM and TCP_SENDFILE test - netperf . . . . . . . . . . . . . 427.4 UDP_STREAM test - netperf . . . . . . . . . . . . . . . . . . . . . . . . . . 437.5 TX test - pktgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.6 RFC 2544 - throughput test - brctl . . . . . . . . . . . . . . . . . . . . . . . . 447.7 RFC 2544 - frame loss test - brctl . . . . . . . . . . . . . . . . . . . . . . . . . 447.8 RFC 2544 - throughput test - traffic-gen . . . . . . . . . . . . . . . . . . . . . 457.9 RFC 2544 - throughput test - CrossThru FF+L2 . . . . . . . . . . . . . . . . 457.10 RFC 2544 - frame loss test - CrossThru FF+L2 basic . . . . . . . . . . . . . . 467.11 RFC 2544 - frame loss test - CrossThru FF+L2 optimized . . . . . . . . . . . 46

    xiii

  • xiv LIST OF FIGURES

  • Chapter 1

    Introduction

    1.1 Motivation and goal

    The network accelerators are an overlooked group in current research. I was able to findonly one scientific article that focuses on the usage of Cavium OCTEON processor, andthat article was about IPsec [12]. Therefore, I decided to write my thesis about networkaccelerators in 10 GbE networks in general.

    These accelerators could bring an interesting trade-off in a comparison with hardwareswitches. For a higher price customers gets a multifunctional device that can be used notonly for switching but also for packet generating, routing, firewalling, protocol analysis andmany others. On the other hand, the question is whether we need those pricey acceleratorsfor these tasks, or whether we can achieve reasonable results only with the use of a muchcheaper, common PC.

    Since this area is poorly documented, new challenges arise. We do not know whetherwe can use the same software we use in our PCs and switches, whether it will be neededto upgrade our network infrastructure, or whether these accelerators are really worth it tobuy. My thesis should help to answer similar questions.

    The goal of this thesis is to understand how the network accelerators work and how theycan improve the performance in 10 GbE networks. This part is focused on a network cardWANic 56512 with a packet processor Cavium OCTEON. Another part of the goal is alsoto document all the steps that allow the user to install WANic 56512. The documentationprovided by the manufacturer, GE Intelligent Platforms, was insufficient and with manymistakes. Therefore, I decided to write my own installation guide that corrects all mistakes,adds additional steps and procedures, and puts all information into one place.

    Another important task is to do a research about open source benchmarks that cantest the attributes of 10 GbE networks and suggest which benchmarks could be used withWANic 56512. The area of my focus is packet generating, receiving and switching.

    The last goal is a comparison of my approach and an approach proposed by Radim Rokaand Moris Bangoura in their thesis [16], [9]. In other words, whether a network acceleratorbrings any advantages over a regular PC with optimized kernel and network drivers, andcompared to a regular PC with the use of graphic cards.

    1

  • 2 CHAPTER 1. INTRODUCTION

    1.2 Structure of the thesis

    This thesis has eight chapters:

    Chapter 2 - summarizes the related works made by Radim Roka and Moris Bangoura.It also highlights the difference in their and my approach of packet switching.

    Chapter 3 - describes the theoretical background of the WANic 56512 card, its archi-tecture, related hardware for acceleration, packet flow etc.

    Chapter 4 - explains the installation process of the WANic 56512 card and providesnecessary steps for a full operating capability.

    Chapter 5 - discuss the possibilities to perform benchmarks for the WANic 56512 card. Chapter 6 - shows the performed tests and used configuration. Chapter 7 - analyzes the results of performed tests done by Radim Roka, MorisBangoura and Michael Rohrbacher.

    Chapter 8 - summarizes the thesis and all the results. Also possible future works aresuggested.

  • Chapter 2

    Related works

    In this chapter I will briefly summarize two masters thesis written by my colleagues RadimRoka and Moris Bangoura. They worked on a similar topic but with different hardwareand approach.

    2.1 Performance evaluation of GNU/Linux network bridge

    This masters thesis was written by Radim Roka in 2011 and deals with a problem ofcreating a network bridge in 10 GbE networks with the use of the Linux operating system.The output of this thesis is a comparison between the device designed by the author and ahardware switch.

    The author firstly covers the theory of benchmarking network devices, defines basic termsand describes the necessary steps for different benchmarks. He also identifies different typesof delays we have to take into account in 10 GbE networks.

    The main part of the thesis is about finding the appropriate hardware (CPU, mother-board, chipset, BUS, NIC, etc.) for switching traffic in 10 GbE networks as well as findingthe right tools for switching and packet generating. For switching brctl and for packetgenerating pktgen are used, and the final hardware configuration is following:

    2x motherboard: Supermicro X8DAH+-F,

    CPUs: 2x Quad-core Xeon 5606, 2.13GHz and 2x Quad-core Xeon 5620, 2.4GHz withHT technology,

    2x 3x2GB DDR3-1066 RAM for each computer,

    10GbE NIC: dual port 10GbE Intel controller 82599,

    3x 1GbE NIC with 6 ports and Intel controller 82576.

    3

  • 4 CHAPTER 2. RELATED WORKS

    This configuration was tested against following hardware switches:

    H3C S5800 network switch 24x 1GbE and 4x 10GbE ports, Juniper Ex3200 switch 24x 1GbE ports and 2x 10GbE.

    The comparison is done against several operating systems, namely: GNU/Linux Debian,GNU/Linux Bifrost, and FreeBSD. First of all, the author runs several benchmarks withoutany optimization on one, two, and four output devices. Further on, the author proposessome optimization, such as:

    turn off the flow control, increase the ring buffer, assign interrupts and set up the SMP affinity, set up the receive/transmit queues.

    The performance of the proposed solution is rather poor. The system is capable of generating64B packets almost at the wire speed, and at the wire speed for packets larger than 128B.But the throughput for 64B packets is about 4Mpps, whereas the line rate is about 14.8Mpps.

    2.2 10GbE Routing on PC with GNU/Linux

    This masters thesis was written by Moris Bangoura in 2012 and builds on the mastersthesis written by Radim Roka. The main difference here is that the computing is done onGPGPU cores and the author was able to perform not only switching but also routing andfirewalling.

    First, the background theory is described, including the description of PacketShader andnetmap framework. The author also describes the GPGPU architecture. Then he deals withfinding the right architecture, hardware and software for switching, routing and firewalling.The final configuration is following:

    2x motherboard: Supermicro X8DAH+-F, CPUs: 2x Quad-core Xeon 5620, 2.4GHz with HT technology, GPGPUs: 2x NVIDIA GTX 580, 3 GB RAM, 2x 3x2GB DDR3-1066 RAM for each computer, 10GbE NIC: 2x dual port 10GbE Intel controller 82599.

    The author modifies the PacketShader I/O module and adds a TX IP header checksumcomputing. He also creates an application based on the PacketShader framework for GPGPUfirewalling.

  • 2.2. 10GBE ROUTING ON PC WITH GNU/LINUX 5

    The results of the proposed prototype were quite positive. The system is able to achievethe linereate speed for:

    transmitting from 64B packets, receiving from 256B packet, routing from 512B, firewalling in this case, the results depends on a number of ACL rules in the routingtable. For 0-256 rules, the system is able to operate almost at the wirespeed. Withmore rules, the performance falls down rapidly.

    The main difference between these theses and my thesis is that I will use a specializedpiece of hardware for packet processing which includes hardware accelerated units for betterperformance. It will be a completely different architecture and it will use software writtenfor this tasks in particular.

  • 6 CHAPTER 2. RELATED WORKS

  • Chapter 3

    Theoretical description of WANic56512

    3.1 General overview

    The GE Intelligent Platforms WANic 56512 is an intelligent, high performance packet pro-cessor which contains [8]:

    Cavium OCTEONTM Plus 12-core 750 MHz CN5650 processor, 4GB of high-speed DDR2 SDRAM via VLP Mini-RDIMMs, 32 MB of DDR SDRAM persistent memory, 2GB USB Flash Disk, 2x 10 Gb Ethernet via SR/LR SFP+ transceivers, 4 lane PCI-Express host interface.

    For a better understanding of WANics equipment and its connection, the block diagramis included in figure 3.1.

    The packet processor is a special type of processor built explicitly to deal with issueswhich arise in computer networks. Such as monitoring, network management, security,etc. These devices perform the data inspection, identification, extraction, and all otherkinds of data manipulation which can be later used for load balancing, traffic shaping androuting. Their main advantage over normal processors is the presence of specific softwareand hardware developed for the packet flow in particular. This assures the best possible linerates.

    The WANic 56512 is equipped with a packet processor OCTEON CN5650, developedby a company Cavium Networks. This processor is based on the Microprocessor withoutInterlocked Pipeline Stages (MIPS) architecture which I will further describe in more de-tails later on. The main key feature of Caviums processors is the presence of HardwareAcceleration Units. They have a huge effect on:

    7

  • 8 CHAPTER 3. THEORETICAL DESCRIPTION OF WANIC 56512

    Figure 3.1: Block diagram of WANic 56512. Source: [8]

    packet I/O processing, Quality of Service (QoS), TCP, security - IPsec, SSL and 3G/UMB/LTE, compression/decompression.

    Another key feature which is very desirable nowadays is the power consumption. Ac-cording to the product brief [4], Cavium Networks claims that the power consumption forthe CN5650 chip is only 10 30W. Among other considerable features belong the dedicatedDMA engines for each hardware unit, high-speed interconnects between the hardware units,and an ability to group the cores as desired.

    The communication between the host computer and the WANic 56512 card is providedvia the PCI-Express bus. The whole control management of the card is done either via theserial console or PCI console. This process will be further described later on.

    All described units can be found in the block diagram 3.2.

    3.1.1 The MIPS Architecture

    Since the OCTEON chip is based on the MIPS architecture, I will describe in more detailswhy the chip is built on this particular architecture. The text in this subsection is based onthe text of my bachelor thesis - Measurement of throughput of Ethernet Cards on TelumNPA-5854 device [15].

    The MIPS architecture is a typical example of the Reduced Instruction Set Computer(RISC) Instruction Set Architecture (ISA), based on the principle of main registers. MIPS64was first introduced in 1991 and it was the first 64 bit architecture in the world. MIPS usesthe register-register approach, sometimes also called the load-store architecture.

  • 3.1. GENERAL OVERVIEW 9

    Figure 3.2: Block diagram of OCTEON CN5650. Source: [3]

    Advantages of using registers:

    Registers are faster than the memory. Access to the register can be random. Less accesses to the memory. Registers can store intermediate results, local variables and parameters.

    Disadvantages of using registers:

    The number of registers is limited. A more complex compiler is needed. Longer context switch. Registers cannot store composing data structures.

    Characteristic of RISC:

    Fixed-length instructions (32 bit) which results in less decoding Three-addresses architecture - all three registers need to be specified, for example:add \$ s0, \$ s1, \$ s2 means s0 = s1 + s2.

    A large number of registers to use (32).

  • 10 CHAPTER 3. THEORETICAL DESCRIPTION OF WANIC 56512

    All instructions take the same processing time. Very fast instructions processing. The pipeline processing is easy to implement.

    Figure 3.3: MIPS architecture, pipelined. Source: [17]

    3.1.2 Comparison between the MIPS and x86 architecture

    The main difference between these two architectures is that MIPS is an example of RISCbut the x86 architecture is an example of Complex Instruction Set Computer (CISC) ISA.

    Other differences

    MIPS has aligned data, x86 not. MIPS has 32 registers, x86 has only 8. MIPSs return address is always the register 31, x86 uses the stack. MIPS is a load-store architecture (simpler hardware, easier to pipeline, higher per-formance), x86 is a memory-register architecture (fewer instructions in the programresults to a smaller code, more complicated hardware, and more instructions need tobe implemented).

    MIPS has the fixed-length instructions, x86 has the variable-length instructions.

  • 3.2. HARDWARE ACCELERATION UNITS 11

    3.2 Hardware Acceleration Units

    Each OCTEON chip contains several hardware acceleration units that ooad the work andfree the cores. The units can be divided into:

    Packet-management accelerators The traffic of packets can be very enormous in busynetworks. Therefore, it is desirable to ooad the time consuming packet processingfrom the cores. The packet-management accelerators are responsible for packet re-ceiving, transmitting, buffering, QoS and the packet flow. The packet data buffersare automatically created and freed. In terms of TCP and UDP, the packet headersare automatically checked on receive and the checksum is automatically calculated ontransmit. Also the TCP retransmission is implemented in the timer unit. The packetordering and scheduling is managed by its own unit.

    Security accelerators These accelerators are responsible for generating random num-bers, accelerating security algorithms and related operations. Such as MD5, SHA,3DES, AES, RC4, KASUMI, RSA and TKIP.

    Application accelerators The CN5650 chip has units to provide acceleration forDEFLATE compression/decompression, CRC checksums for ZLIB and GZIP, and ac-celeration for RAID 5 and RAID 6.

    3.3 Packet Flow

    Before explaining the packet flow it is convenient to describe units that are involved in thisprocess.

    SSO unit - The Schedule/Synchronization and Order Unit manages the packet schedul-ing and ordering.

    PIP unit - The Packet Input Processor Unit works with IPD to manage the packetinput.

    IPD unit - The Input Packet Data Unit works with PIP to manage the packet input. PKO unit - The Packet Output Unit unit manages the packet output. FPA unit - The Free Pool Allocator Unit manages pools of free buffers, includingPacket Data buffers.

    The process of the packet flow is crucial for the understanding how the packet processingis performed inside the OCTEON chip and also for writing new software applications. Thewhole process can be divided into three main sections.

    Packet input In this phase the packet is received and checked for errors by the RX port.Then the packet is passed to the IPD unit where the packet data is shared with the PIPunit. The PIP unit is responsible for the packet parsing. The IPD unit stores the packetdata in the Packet Data Buffer (allocated from the FPA unit) in L2/DRAM. The DMAapproach is used for this process. Also a pointer to the appropriate QoS queue in the SSOunit is created. See figure 3.4 for more details.

  • 12 CHAPTER 3. THEORETICAL DESCRIPTION OF WANIC 56512

    Figure 3.4: Packet input. Source: [11]

    SSO and core processing The SSO unit schedules the work to be done based on theQoS priority, ingress order and current locks. The cores then process the packet data whichis read and written in L2/DRAM. After this processing, each core sends a pointer to thepacket data buffer and data offset to the appropriate Packet Output Queue in the PKOunit. The output port and packet priority are specified. See figure 3.5 for more details.

    Packet output Within this phase the PKO unit copies the data from the buffer describedabove into its own memory and adds the TCP or UDP checksums if desired. Then the PKOunit sends the data from its memory to the output port and the packet is transmitted bythe TX port. See figure 3.6 for more details.

    Many of the hardware acceleration units described in the previous section play a significantrole in the packet flow. This results in elimination of bottlenecks because cores can workon packet processing in parallel, without the need of classification and prioritization of thepackets.

  • 3.4. SIMPLE EXECUTIVE 13

    Figure 3.5: SSO and core processing. Source: [11]

    3.4 Simple Executive

    The simple executive is an Application Programming Interface (API) which provides aHardware Abstraction Layer (HAL) to the hardware units included on the OCTEON chip.The functions provided by the simple executive API can be used to develop a standalone oruser-mode simple executive application. The user-mode means that the application is runfrom the Linux operating system. The differences between these two run-time modes havea huge influence to the overall performance of the application.

    Standalone mode When running an application in the standalone mode, the best pos-sible performance should be assured. There are no context switches and the whole memoryis mapped for fast access. There is also a great opportunity for scaling because all the corescan run the same application.

  • 14 CHAPTER 3. THEORETICAL DESCRIPTION OF WANIC 56512

    Figure 3.6: Packet output. Source: [11]

    User-mode On the other hand, when an application is run in the user-mode, we haveto take into account the cache and TLB misses, and higher traffic on the buses. At leastone core is reserved for the Linux, and also the memory has to be divided between theapplication and the Linux kernel.

    In this chapter I tried to point out the most important pieces of hardware representedon the card. In section 3.1.1 I showed why the designers of the OCTEON chip chose theMIPS architecture over the x86. It is due to the fact that this architecture is the bestcandidate for parallel processing. Which is highly desirable to achieve the best packetprocessing performance. Since the process of handling packets, especially the small ones,can be very CPU consuming, the designers added hardware accelerators to the card to easethe CPU usage. I also described the main difference between the standalone and user-modefor applications developed for the card.

  • Chapter 4

    Installation of WANic 56512

    The installation of the card was the most time-consuming and difficult part of my thesis.Therefore, I decided to include procedures and approaches I tried as a regular chapter.

    4.1 Description/Specification of the host system

    The WANic 56512 is inserted into the following host system:

    Intel R CoreTM2 Duo CPU E8200 @ 2.66GHz,

    2x Corsair 2GB DDR2 RAM (CM2X2048-6400C5, 800 MHz),

    Gigabyte GA-EP45-DS4 motherboard,

    nVidia GeForce 9600 GT,

    Linux kernel 2.6.32-40-generic SMP i686, Ubuntu 10.04 LTS,

    Seasonic SS 500GB Active PFC F3 power supply.

    The first problem was to find a power supply powerful enough to run both the computerand the card. After trying several power supplies I found that the power supply needs tobe at least 500W.

    The next issue was cooling the card. The card is designed for an air-cooled chassis envi-ronment that we do not have. First attempts with the card resulted in a permanent damagedue to the overhead. Nevertheless, the overhead was most likely caused by a hardware fail-ure. After the reclamation we used an optional fan that provides more air-flow and keepsthe temperature bellow 105 C which is desired by the manufacturer.

    The card is connected to the host system via the PCI-Express x8 bus and to the labo-ratory network via the H3C S5800 switch. The whole network topology includes also thegenerator which is also connected to the switch. The topology is shown in figure 4.1.

    15

  • 16 CHAPTER 4. INSTALLATION OF WANIC 56512

    Figure 4.1: Topology of the lab.

    4.2 Installation procedures

    The installation process of the card was much more complicated than I expected. There arebasically two approaches how to install the card depending on whether we buy the SDK ornot.

    4.2.1 Installation procedure without buying the SDK

    My first attempt was to simply connect the WANic 56512 into the host system and waitwhether the card will be recognized by the operating system or not. Ubuntu successfullyrecognized the card the lspci command showed: 02:00.0 MIPS: Cavium Networks OcteonCN57XX Network Processor (CN54XX/CN55XX/CN56XX) (rev 09), but the interfaces onthe card were not recognized and they did not show up in ifconfig.

    4.2.1.1 Diagnostic modes kernel

    The next move was to use the diagnostic mode. This mode contains a Linux kernel imageconfigured by the manufacturer and is stored on the cards memory. To run this image, theDIP switch 4 needs to be changed to the ON position. The location of the needed switchis shown in the figure 4.2. Next, we need to establish a connection between the card andthe host system to see the output. The only option without buying the SDK is a serialconsole. We already have the needed 20-pin RS-232 connector from my bachelorsthesis. Otherwise, we would have to buy the serial adapter kit.

    A program called minicom is used to connect to the serial output of the card. Parametersof the connection are: 115200 8N1, no flow control. After powering up the card, the serialoutput is redirected to our console. And after the boot process is finished, the busyboxprompt is shown.

    The busybox environment is very limited. Only a few basic commands are available,such as ls, cat, cp, ping, chmod, netstat and vi.

    I was able to get the Debian GNU/Linux running on the card using following steps. Iused the wget program to copy the basic Debian GNU/Linux root file system to the built-inflash memory on the card. Then I created a chroot environment:

    mkdir /mnt/chrootmount /home/root/usb /mnt/chrootmount -o bind /proc /mnt/chroot/procmount -o bind /dev /mnt/chroot/devchroot /mnt/chroot

  • 4.2. INSTALLATION PROCEDURES 17

    Figure 4.2: The needed DIP switch. Source: [5]

    The chroot environment changes the root directory of the diagnostic kernel to the rootdirectory of the copied Debian GNU/Linux. After this change, we have a fully workingoperating system on the card and we can install tools for benchmarking. The /dev and/proc directories are needed to provide network interfaces.

    The problem with this solution is that there is no possibility to change the kernel con-figuration. For instance, if we want to use pktgen which is a kernel module for generatingpackets, we need to recompile the kernel with required options first. But this is not feasiblewith this configuration.

    4.2.1.2 cnusers SDK

    The next possibility was to use a freely available cnusers SDK. This SDK can be downloadedat http://www.cnusers.org/ after a successful registration and its approval. It contains,among others, the source code of U-BOOT, Linux kernel 2.6.32 and a few examples. Unfor-tunately, it does not contain any patches for the WANic 56512 card. Therefore, the built-inflash memory is not accessible and the octeon-ethernet driver is not optimized.

    Nevertheless, I was able to cross-compile the Linux kernel with the pktgen module andget a fully working kernel. This SDK also contains the needed octeon-ethernet driver formaintaining network interfaces. But as I mentioned above, the driver is not optimized forthe WANic 56512 card. It is merely a generic driver.

    Cross-compilation is a special method that gives us the ability to compile a desiredsource code for an architecture other than the one on which the compiler is. In my case itwas from x86-64 to MIPS. The cnusers SDK contains all necessary tools for it toolchains.

  • 18 CHAPTER 4. INSTALLATION OF WANIC 56512

    The advantage of cross-compilation is that we can use the computing power and storagecapacity of the host system rather than the limited resources of the embedded device.

    The process of copying a new kernel into the card is not very convenient. I had to usea TFTP server and copy the new kernel over a TFTP protocol. I ran following commandsin the U-BOOT environment:

    setenv ipaddr 10.101.1.101setenv serverip 10.101.1.100setenv ethact octeth0ping 10.101.1.100tftpboot 0x20000000 /tftpboot/vmlinux.64bootoctlinux 0x20000000 coremask=0xfff

    First, I set up the IP addresses of the card, the TFTP server and the active network in-terface. The ping command is necessary to bring the interfaces up. The tftpboot commandcopies the kernel image from the TFTP server to a specific address (the address 0x2000000 isrecommended by the manufacturer). Finally, the bootoctlinux command boots the copiedkernel image. With the coremask parameter, I can specify how many cores will run theloaded image (0xfff = 12).

    The limitation of this solution is the unavailability of the built-in flash memory withoutit, I cannot save any information or install new programs. There is only the shared memoryavailable which is cleared after every reboot. Yes, there is a possibility to use the NFSprotocol but this solution would require access to one of the ports on the card. And sincethe card has only two of them, this approach is out of the question.

    4.2.2 Installation procedure with buying the SDK

    From previous statements, it is obvious that we need to buy the SDK to get a better controlof the card. We bought the SDK from GE Intelligent platform believing that it will containeverything needed for a software development. As I discovered later, the SDK contains onlythe SDK from cnusers site with patches for our card and some more examples. But it doesnot contain the documentation of simple executive API and its functions. Nevertheless,with this SDK I was able to fully maintain the WANic 56512 card even via the PCI console.

    I will now describe the necessary steps how I proceeded. I have to say that the levelof documentation provided by the manufacturer was very poor. It does not contain theessential information and it was full of mistakes. Hence, I will provide the installation stepsin a correct way.

    First of all, there is a mistake in the installation script. The shell in the host systemcould not process lines with command >& /dev/null so I had to change it to:command >/dev/null 2>&1. A working installation script is included on the attached CD.After this change, I was able to successfully install the SDK:

  • 4.2. INSTALLATION PROCEDURES 19

    cd /the/main/directory/of/the/CD-ROMsh install.sh /home/octeon/

    This command will decompress files into the desired directory. Then I had to specify theOCTEON model for GNU toolchains that are needed for the cross-compilation. There isa script that will make necessary changes in the bash environment (this script needs to berun before every cross-compilation):

    cd /home/octeon/OCTEON-SDKsource env-setup OCTEON_CN56XX_PASS2

    Then I had to apply patches that add support for the WANic 56512 card. They includesupport for the PCI console, built-in flash memory, ethernet driver, etc. The patches canbe applied by running:

    cd /home/octeon/OCTEON-SDK/cav-gefesmake patches-install

    A very useful thing is to include programs I need for the benchmarks directly to the rootdirectory of the embedded file system. For example if I want to use iperf without runningthe chroot environment, I have to do following:

    mkdir /tmp/extra-filescp iperf /tmp/extra-filescd /home/octeon/OCTEON-SDK/linux/embedded_rootfsmake menuconfigSelect programs I want - bridge-utils, ethtool, tcpdumpSpecify the Embedded rootfs extra-files directorymake all

    Everything in the /tmp/extra-files directory will be included into the embedded filesystem. A good point to mention here is that the /tmp/extra-files directory has to becreated even if we do not want to include any additional programs. Without this directory,the compilation would be unsuccessful. With a prepared root file system, I could proceedto a cross-compilation of the SDK.

    cd /home/octeon/OCTEON-SDK/cav-gefesmake menuconfigIn the Build Options, under the Embedded Linux Options choose theManufacturing BuildIn the Build Options, specify the Embedded rootfs extra-files directorySelect bash for a better shellmake all

    For a successful cross-compilation I had to install also yacc, flex and gettext packages.Next, I edited the /home/octeon/OCTEON-SDK/linux/kernel_2.6/kernel.config file toenable the pktgen module. Another kernel options can be specified in this file. After thesuccessful cross-compilation, the final kernel image is stored in/home/octeon/OCTEON-SDK/linux/kernel_2.6/linux/ as a vmlinux.64

  • 20 CHAPTER 4. INSTALLATION OF WANIC 56512

    4.2.2.1 PCI console

    To get a working PCI console redirection, I had to first recompile the U-BOOT to the newversion.

    cd /home/octeon/OCTEON-SDKsource env-setup OCTEON_CN56XX_PASS2cd /home/octeon/OCTEON-SDK/bootloader/u-bootmake clobbermake octeon_w56xx_config //for a regular imagemake octeon_w56xx_ram_debug_config //for a RAM debug imagemake octeon_w56xx_failsafe_config //for a Failsafe imagemake

    The PCI console redirection works only with the regular U-BOOT image. For a workingPCI console I had to compile and load the PCI driver first. This driver allows the PCIcommunication between the WANic 56512 and the host system. A very important note fora cross-compilation of the driver is that the host system must contain header files of thekernel which is run on the system. Without it, the cross-compilation process ends with avery hard-to-find make error.

    cd /home/octeon/OCTEON-SDK/make -C components/driver/insmod components/driver/bin/octeon_drv.ko

    After running the insmod command, the driver should be loaded. Verification could be doneby running dmesg and the output should look like this:

    [29520.022310] octeon_drv: module license Cavium Networks taints kernel.[29520.022314] Disabling lock debugging due to kernel taint[29520.027178] -- OCTEON: Loading Octeon PCI driver (base module)[29520.027181] OCTEON: Driver Version: PCI BASE RELEASE 2.0.0 build 73[29520.027183] OCTEON: System is Little endian (250 ticks/sec)[29520.027185] OCTEON: PCI Driver compile options: NONE[29520.027209] OCTEON: Found device 177d:50..Initializing...[29520.027216] OCTEON: Setting up Octeon device 0[29520.027230] Octeon 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16[29520.027236] Octeon 0000:02:00.0: setting latency timer to 64[29520.027241] OCTEON[0]: CN56XX PASS2.1...[29520.027446] OCTEON[0]: BIST enabled for CN56XX soft reset[29520.037373] OCTEON[0]: Reset completed[29520.037381] OCTEON[0] Poll Function (Module Starter arg: 0x0) registered[29520.039233] OCTEON[0]: Detected 4 PCI-E lanes at 2.5 Gbps[29520.039241] OCTEON[0]: Enabling PCI-E error reporting..[29520.039247] OCTEON[0]: CN56XX Pass2 Core Clock: 750 Mhz...

  • 4.2. INSTALLATION PROCEDURES 21

    [29520.039299] Octeon 0000:02:00.0: irq 31 for MSI/MSI-X[29520.039304] OCTEON[0]: MSI enabled[29520.039318] OCTEON: Octeon device 0 is ready[29520.039375] -- OCTEON: Octeon PCI driver (base module) is ready![29520.039922] -- OCTEON: Octeon Poll Thread starting execution now!

    Now, when I have the U-BOOT image and the driver loaded, I can boot the card over thePCI bus. For the PCI boot, the DIP switch 3 needs to be changed to the ON position.These commands will boot the card and the output will be redirected to the PCI console:

    export OCTEON_REMOTE_PROTOCOL=PCI:, where the is a number fromthe dmesgs output - usually it is 0

    cd /home/octeon/OCTEON-SDK/host/remote-utilsoct-remote-boot --board=W5651X --filename=, where the

    is the path to the .bin U-BOOT imageoct-remote-bootcmd "setenv pci_console_active yes"oct-remote-console --noraw

    If the PCI console is not a requirement, I can send the input to the card with theoct-remote-bootcmd command.

    And now I have a fully working kernel with all the changes in .config I wanted and thePCI console redirection. To start working with the card I can use the following procedure:

    export OCTEON_REMOTE_PROTOCOL=PCI:0cd /home/octeon/OCTEON-SDK/host/remote-utils./oct-remote-boot --board=W5651X ../../bootloader/u-boot-octeon_w56xx.bin./oct-remote-load 0 ../../linux/kernel_2.6/linux/vmlinux.64./oct-remote-bootcmd "bootoctlinux 0 coremask=0xfff console=pci"

    These commands will boot the card with a compiled kernel and the output will be redirectedto the PCI console. After a successful boot, the command prompt will be ready. It isconvenient to create a chroot environment:

    mkdir /mnt/chrootmount /dev/sda/mnt/chrootmount -o bind /proc /mnt/chroot/procmount -o bind /dev /mnt/chroot/devmodprobe npaDrivermodprobe octeon-ethernetmodprobe pktgenchroot /mnt/chrootbash

    Now we have a fully working Linux environment with everything we need to performmeasurements with benchmarks. The rest of this chapter is a description of WANics featuresthat can simplify the work with the card.

  • 22 CHAPTER 4. INSTALLATION OF WANIC 56512

    4.2.2.2 NIC mode

    The WANic 56512 can operate in two different modes. The first one is the NIC mode andallows us to use the front-ends on the card via the host system. To enable this mode,following steps need to be performed:

    cd /home/octeon/OCTEON-SDKsource env-setup OCTEON_CN56XX_PASS2cd /home/octeon/OCTEON-SDK/cav-gefesmake menuconfigIn the Cavium PCI Components, under the Build Options choose the2 x 10G Port NIC modeSelect all the remaining options in the Cavium PCI Component menumake all

    There is a problem in the compilation process with the OCTEON NIC driver becauseof the exported symbol dependencies. The NIC driver depends on exported symbols fromthe PCI driver, but the information about them is missing in the module. Therefore, itis needed to specify where the compiler should look for them. The solution is to copy theModule.symvers file fromOCTEON-SDK/components/driver/host/driver/linux toOCTEON-SDK/components/driver/host/driver/linux/octnic. This file contains a list ofall exported symbols used both by the PCI and the NIC driver. Now I can continue withbooting the card and loading drivers.

    export OCTEON_REMOTE_PROTOCOL=PCI:0cd /home/octeon/OCTEON-SDK/insmod components/driver/bin/octeon_drv.kocd /home/octeon/OCTEON-SDK/host/remote-utils./oct-remote-boot --board=W5651X ../../bootloader/u-boot-octeon_w56xx.bin

    Now I need to load some application to initialize and recognize the interface ports.Without this step, the NIC driver would not work properly. There is an example applicationin the SDK from GE called cvmcs-nic.

    ./oct-remote-load 0 ../../components/driver/bin/cvmcs-nic.strip

    ./oct-remote-bootcmd "bootoct 0 coremask=0xfff"

    Now, there is a tricky part. The application is loaded and working but I had to wait forthe first output of driver stats on the console. After this occurrence I could continue withloading another kernel module:

  • 4.2. INSTALLATION PROCEDURES 23

    insmod ../../components/driver/bin/octnic.ko

    Similar messages can be found in dmesg output:[ 1487.306913] OCTEON[0]: Received active indication from core[ 1487.306923] OCTEON[0] is running NIC application (core clock: 750000000 Hz)[ 1488.136798] OCTEON[0]: Starting module for app type: NIC[ 1489.140130] OCTEON[0] Poll Function (Module Starter arg: 0x0)

    completed (status: Finished)[ 1577.558616] -- OCTNIC: Starting Network module for Octeon[ 1577.558627] Version: PCI NIC RELEASE 2.0.0 build 73[ 1577.558633] OCTNIC: Driver compile options: XAUI_DUAL[ 1577.558642] OCTEON: Registered handler for app_type: NIC[ 1577.558647] OCTEON[0]: Starting modules for app_type: NIC[ 1577.558655] OCTNIC: Initializing network interfaces for Octeon 0[ 1577.567103] OCTNIC: oct0 -> 10000 Mbps Full Duplex UP[ 1577.569611] OCTNIC: oct1 Link Down[ 1577.569621] OCTEON[0] Poll Function (NIC Link Status arg: 0xfa16f000) registered[ 1577.569626] OCTNIC: Network interfaces ready for Octeon 0[ 1577.569631] -- OCTNIC: Network module loaded for Octeon

    Now I can access the two interfaces on WANic 56512 via oct0 and oct1 interfaces onthe host system.

    4.2.2.3 Ethernet PCI mode

    This second mode allows us to use the processing power of the OCTEON chip by sendingall the traffic of the host system interfaces to the card using the Ethernet frames over PCI.To enable this mode, following steps need to be performed:

    cd /home/octeon/OCTEON-SDKsource env-setup OCTEON_CN56XX_PASS2cd /home/octeon/OCTEON-SDK/cav-gefesmake menuconfigIn the Cavium PCI Components, under the Build Options choose theEtherPCI modeSelect all the remaining options in the Cavium PCI Component menumake all

    There is the exactly same problem with the driver as in the previous case; the solution isthe same. After the fix, I could continue with booting the card and loading drivers.

    export OCTEON_REMOTE_PROTOCOL=PCI:0cd /home/octeon/OCTEON-SDK/insmod components/driver/bin/octeon_drv.kocd /home/octeon/OCTEON-SDK/host/remote-utils./oct-remote-boot --board=W5651X ../../bootloader/u-boot-octeon_w56xx.bin./oct-remote-load 0 ../../../vmlinux.64./oct-remote-bootcmd "bootoctlinux 0 coremask=0xfff console=pci"

  • 24 CHAPTER 4. INSTALLATION OF WANIC 56512

    Now I had to load the modified octeon-ethernet driver by performing

    modprobe octeon-ethernet

    on the WANic 56512 card. Then I needed to load the NIC driver on the host system

    cd /home/octeon/OCTEON-SDK/insmod components/driver/bin/octnic.ko

    After loading the NIC driver, I could see the octX interfaces on the host system and thepciX interfaces on the WANic 56512 card (oct0 refers to pci0 etc.). These interfaces canbe configured using the ifconfig.

  • Chapter 5

    Benchmarks for WANic 56512

    Benchmarks for the WANic 56512 card can be divided into two main groups, depending onthe environment in which the benchmarks are run. Each group can be furthermore divided,depending on what we want to measure RX, TX, switching, etc. The measurements will bebased on principles from RFC 2544 (Benchmarking Methodology for Network InterconnectDevices) [10]. I will try to perform identical measurements with different programs to findthe best benchmark for each task. I will also try to perform each task on a different numberof cores to verify whether the task is core-dependent or not.

    5.1 RFC 2544

    This RFC defines a set of tests that can be used to measure the performance and parametersof the tested network and network devices. In my thesis, I will perform the throughput andthe frame loss rate test.

    The RFC 2544 also defines the frame sizes of packets to be used in the measurements.The defined sizes are: 64, 128, 256, 512, 1024, 1280, 1518 and the unit is byte.

    5.1.1 Throughput

    To perform this test, we need to send a chosen number of packets (x) at a specific rate (framesper second) to the tested device. Then we need to count the frames that are transmitted bythe tested device (y). If x 6= y, then the test needs to be re-run with an adapted rate value.

    The resulting throughput is then the fastest rate at which the count of test framestransmitted by the Device under Test (DUT) is equal to the number of test frames sent toit by the test equipment. [10]

    5.1.2 Frame loss rate

    To perform this test, we need to send a chosen number of packets (x) at a specific rate (framesper second) to the tested device. Then we need to count the frames that are transmitted by

    25

  • 26 CHAPTER 5. BENCHMARKS FOR WANIC 56512

    the tested device (y). The frame loss rate at each point is calculated by using the followingformula:

    ((x y) 100)/x

    This test should start at the 100% of the maximum rate of the input media. Then thewhole procedure should be repeated for the 90% of the maximum rate and so on. Thisprocess should be repeated until there are two runs of the test with no frame loss.

    5.2 Benchmarks for Linux environment

    In this section I will describe available benchmarks that run on GNU/Linux operatingsystem. I will focus only on open-source programs and tools. The goal is to find the bestbenchmark tool for packet generating, transmitting, receiving, and packet switching thatwill run on WANic 56512.

    5.2.1 iperf

    The first program in my list of benchmark tools is iperf. This tool is used to measure andverify the bandwidth of the tested network. It is capable of generating both the TCP andthe UDP packets and it runs as a client-server model.

    This tool can be used for benchmarking parameters of transmitting and receiving packets.iperf runs in the user-space, therefore, there is a limitation of the overall performancebecause of the system calls.

    The whole program is controlled from the command line and its parameters. iperfcan be started by running iperf -c x.x.x.x on the client side, where the x.x.x.x is theIP address of a server, and accordingly on the server side iperf -s. There are severalparameters that can affect the overall performance:

    -w Set the TCP window size.

    -u Use the UDP rather than the TCP.

    -b When using the UDP, sets the bandwidth to send at in bits/sec.

    -l Set the length of the buffer to read or to write. This basically means the size ofthe packet.

    -d Run the bi-directional test simultaneously.

    -t Set the time duration of the test in seconds.

    -P Set the number of parallel connections when using the TCP protocol.

    -V Use IPv6 instead of IPv4.

  • 5.2. BENCHMARKS FOR LINUX ENVIRONMENT 27

    5.2.2 netperf

    netperf is another tool for benchmarking transmitting and receiving of packets. netperf issimilar to iperf, both programs are based on the client-server architecture, both programsrun in the user-space, and both programs can generate as the TCP so the UPD packets.

    The advantage of using netperf is the TCP_SENDFILE and the UDP_SENDFILE test. Thistest should result in a lower CPU utilization and higher throughput because the data canbe send directly from the file system buffer cache.

    The known disadvantage of using netperf is the lack of shaping algorithms. This meansthat there is no control of outgoing traffic, which results in flooding the receiver (sendingis easier than receiving). This fact can have a several influence on the performance resultswhile using UDP packets.

    The maintenance of netperf is similar to iperfs. The client can be run by typingnetperf -H x.x.x.x into the command line, where x.x.x.x is the IP address of the server.The server can be started either from the command line by typing netserver or as the inetdservice. There are also some important parameters:

    -t Specify the test to perform. TCP_STREAM, TCP_SENDFILE etc. -T Bound netperf to a specific CPU. -m Set the size of the buffer passed in to the send calls of a _STREAM test. -M Set the size of the buffer passed in to the receive calls of a _STREAM test. -s Set the size of the netperf send and receive socket buffer for the data connection. -S Set the size of the netserver send and receive socket buffer for the data connection. -D Display the results immediately during the performed test.

    With the -T parameter I can run netperf in multiple instances on separate CPUs toavoid the CPU bottleneck. For instance, if I have 4 CPUs, I can do following:

    netperf -H x.x.x.x -T0,0 -l 120 &netperf -H x.x.x.x -T1,1 -l 120 &netperf -H x.x.x.x -T2,2 -l 120 &netperf -H x.x.x.x -T3,3 -l 120 &

    The -l 120 argument is for a longer run of the test to assure that the tests will runsimultaneously.

    5.2.3 curl-loader

    curl-loader is a slightly different benchmark tool. The main diversity lies in creatingthousands of virtual clients that can connect to a website. curl-loader is often comparedto commercial products, such as Spirent Avalanche and IXIA IxLoad.

  • 28 CHAPTER 5. BENCHMARKS FOR WANIC 56512

    The clients can connect and login to a specific website using one of the following proto-cols: HTTP, HTTPS, FTP and FTPS. Their IP addresses can be shared, unique or assignedfrom the IP pool. The number of clients is hardware-dependent it can oscillate between2500 and 100 000, and even more.

    The goal of this tool is to generate as much clients as possible and let them connect toa single web-server. This method is called a stress test and reminds the basis of a Denialof Service (DoS) attack. I would like to find out whether the WANic 56512 can be used forgenerating reasonable traffic for the DoS attack and whether the card can handle to receiveand process such traffic.

    curl-loader is controlled from a command line and needs a configuration file to run. Torun curl-loader, type curl-loader -t x -f into the command line, wherex is the number of running threads and is the configuration file.

    5.2.4 pktgen

    pktgen is a kernel module for generating packets at wire speeds. To use this module, wehave to first include the pktgen module into the current kernel. To do so, we need to enableCONFIG_NET_PKTGEN in the .config file, recompile the kernel and insmod or modprobe thepktgen module. pktgen then creates a thread for each CPU in the system.

    The advantage of pktgen is the fact that it runs in the kernel-space. This should assurethe best possible performance for packet generating because there are less system calls andinterruptions than while running in the user-space.

    The main disadvantage is that pktgen can generate only the UDP packets. Anotherdisadvantage is the impossibility to assign more than one interface device to one CPU. Thisoption is really crucial since nowadays computers usually have several CPUs. Fortunately,there is a patch for pktgen that solves this problem [14]. This patch adds the multiqueuearchitecture which results in a possibility to assign more than one interface device to oneCPU.

    The maintenance of pktgen is not very user-friendly. The process of generating packetsis configured by the configuration file. The best way to create the configuration file is tohave a look at examples [13] and modify them to our needs.

    pktgen can be simply run by executing our configuration file script. The result of thebenchmark can be viewed by typing cat /proc/net/pktgen/name_of_the_device into thecommand line.

    5.2.5 bridge-utils

    bridge-utils is an administrative package of utilities that allows the user to set up aLinux bridge. This bridge can operate not only as a bridge or switch, but it can alsoperform filtering and shaping of incoming and outgoing traffic.

    To set up a Linux bridge, set 802.1d Ethernet Bridging under the Networking menuduring the kernel compilation. After a successful compilation, load the module with themodprobe or the insmod command. If the loading process was successful, the brctl com-mand should be now functional. There are some important parameters of brctl:

  • 5.3. BENCHMARKS FOR THE SIMPLE EXECUTIVE ENVIRONMENT 29

    addbr Add a bridge with a specified name. addif Add a device to a specified bridge. stp Turns the Spanning Tree Protocol in a specific bridge on/off.

    After creating the bridge, selected devices will enter the promiscuous mode. This moderesults in receiving all traffic in the network by the selected devices which can have a severalimpact on overloading the CPUs.

    5.3 Benchmarks for the Simple Executive environment

    In this section I will describe available benchmarks and tools that run in the Simple Exec-utive environment. The choice was very limited, in fact I was narrowed down to use onlybenchmarks from GE Intelligent Platforms and Cavium Networks.

    Benchmarks for the Simple Executive environment should provide the best possibleperformance in packet processing since they are developed specifically for a device with theOCTEON chip. This means that the applications should use all the available hardwareaccelerators and they should be optimized for an architecture used in these devices.

    Another improvement can be achieved by the fact that there are no interruptions andcommunication between the application and the operating system. Thus, applications forthe Simple Executive can run without the operating system, e.g. GNU/Linux. This situationcan bring another performance improvement there is no need to save at least one core forthe operating system. Therefore, all the available cores can be assigned to the one, runningapplication.

    5.3.1 traffic-gen

    traffic-gen is a benchmark tool for generating packets on a device with the OCTEONchip. traffic-gen can generate packets of all sizes and both protocols TCP and UDP.According the README file, traffic-gen is capable of sustaining 10gbps line rate for allpacket sizes. [7]

    The familiarization with the traffic-gen was not easy at the beginning. The versionincluded in the SDK from GE (2.0) had some problems with XAUI/DXAUI interfaces.Fortunately, in the latest version of the cnusers SDK (2.3 from 8th of March, 2012) is thisbug repaired. With the old version, the application was very unstable. For example, I wasable to run it only once in five tries in a row and the whole behavior of the application wascompletely random. First I thought that there is a problem with the hardware flow controlwhich is needed to pace the output. Therefore I tried different terminal emulators and serialcables, but it did not solve the issue. Luckily, with the new version, everything works fine.

    To run traffic-gen, following steps need to be performed (the serial console output isneeded):

  • 30 CHAPTER 5. BENCHMARKS FOR WANIC 56512

    export OCTEON_REMOTE_PROTOCOL=PCI:0cd /home/octeon/sdk/OCTEON-SDK/host/remote-utils/./oct-remote-boot --board=W5651X ../../bootloader/u-boot-octeon_w56xx_ram_debug.bin./oct-remote-load 0 ../../examples/traffic-gen/traffic-gen./oct-remote-bootcmd "bootoct 0 coremask=0xfff"

    After loading and booting the ELF image, traffic-gen shows a huge amount of statisticson the serial console output which can be very confusing. Following commands show onlyimportant statistics:

    row 1 43 onrow 58 60 offrow 79 81 on

    traffic-gen has an enormous amount of parameters (83!), but only a few of them areimportant for our generating purposes (the full list is available in the README file [7]):

    tx.size [[] ] Set size of packet, excluding the frame CRC.

    tx.percent [[] ] Set the transmit rate as a % of gigabit.

    tx.payload [ ] Set the data type for the payload.

    tx.type [ ] Set the type for the packet.

    start [] | all Start transmitting on these ports.

    stop [] | all Stop transmitting on these ports.

    hide [] | all Hide the statistics for these ports.

    bridge [[] |off] Bridge incoming packets for a port.

    5.3.2 CrossThru

    CrossThru is being developed by GE Intelligent Platforms, Inc. This benchmark applicationcan be used to perform the packet bridging and switching. According to the performed task,CrossThru can be compiled and run in two different processing modes:

    Fast Forwarding This mode is the simplest one with as few as possible performed opera-tions on packets. It basically receive packets on one XAUI port and transmit them throughthe second XAUI port out. This mode should provide the best possible performance.

  • 5.3. BENCHMARKS FOR THE SIMPLE EXECUTIVE ENVIRONMENT 31

    Figure 5.1: CrossThru Flowchart. Source: [2]

    Layer 2 Ethernet Switching With this mode, the basic Layer 2 switching operationsare possible. Such as creating and managing a lookup table. When the MAC address isfound in the lookup table, the packet is transmitted through the learned port. When theMAC address is not found in the lookup table, the CrossThru application will broadcastthat the destination MAC address is unknown.

    Each of these modes can be further optimized. Without optimizations, the SynchronousPacket Order Work (POW) requests and blocking work receive operations are used. Thismeans that packet input and output operations cannot be performed simultaneously. Alsothere is only one queue for the packet output, which is shared with all the cores accordingto the FIFO principle. The available optimization methods are following:

    Asynchronous POW method enables to receive incoming packets in the backgroundand process the previous packet at the same time. This results in reducing the waitcycles required to get the next packet.

  • 32 CHAPTER 5. BENCHMARKS FOR WANIC 56512

    Non-blocking permits the incoming packets to be stored and processed later. Lock-less Packet Output (PKO) allows to assign each core with its own outputqueue. This means that cores does not have to wait for an available output queuewhich speeds up the packet flow process. The number of queues can be specified duringthe compilation process. The numbers are within a range 1-6 for each interface. Withthis method the software needs to use a software tagging of packets because there canbe issues with a packet ordering.

    The installation process is similar to the installation of the SDK.

    cd /home/octeon/sdk/OCTEON-SDKsource env-setup OCTEON_CN56XX_PASS2cd cav-gefesmake menuconfigUnder the SE/Linux Example Applications menu select CrossThruUnder the CrossThru Options select desired optimizationUnder the Build Options menu select Basic Embedded Linux Buildmake all

    There is a fatal mistake in the documentation from GE. CrossThru has to be run in thenormal U-BOOT mode. With any other mode, CrossThru will fail to load.

    The problem with running CrossThru is that it does not provide any kind of statisticscounters by default. When the application runs, there is no output on the serial or PCIconsole. Slightly more statistics can be obtained by specifying the debug mask:

    ./oct-remote-bootcmd "bootoct 0 coremask=0xfff --debug "

    where the is within 0x0 and 0xFFF (0xFFF means output everything mask,the full list of codes is included in [2]). But we have to be careful with the debug mask. Toomuch debug information has an adverse effects on the processing performance. But evenwith the debug mask, there are no useful information or counters for our purposes.

    The GE includes another debug utility within its SDK called dbgse. This utility can,according the documentation [1], display: CPU usage, line usage, port packet statisticscounters, and L2 table. These statistics would perfectly fit our needs but unfortunately, Iwas not able to run the dbgse utility. First, we need to compile the application. The processis the same as for the CrossThru, but a different option has to be selected:

    Under the SE/Linux Example Applications menu select SE Debug Linux Utility

    The problem is that during the load phase of dbgse the whole terminal freezes:

    ./oct-remote-load 0 ../../cav-gefes/examples/simple_exec/crossthru/crossthru

    ./oct-remote-bootcmd "bootoct 0 coremask=0xffe"

    ./oct-remote-load 0 ../../linux/kernel_2.6/linux/vmlinux.64

    ./oct-remote-bootcmd "bootoctlinux 0 coremask=0x1 mem=1024@3072M console=pci"dbgse

  • 5.3. BENCHMARKS FOR THE SIMPLE EXECUTIVE ENVIRONMENT 33

    Now the application returns an error that npa device could not be found and the applicationcould not be started. I had to add the npa device:

    cat /proc/devices and search for npa, usually it is # 251mknod /dev/npa c 251 0mkdir /var/logmodprobe npaDriver

    At this point, the terminal freezes. I did the debugging and I successfully localized the exactline and instruction where the program freezes and sent it as a bug to GE. As a reply I gota message that they do not know why does the program freeze at this point, but they willlook into it. After this reply, I sent several e-mails asking about my situation but I did notreceive any new updates from GE.

    This fact rules out the CrossThru as a fully operative benchmark tool. But I will stilltry to perform as many tests as possible to get a general idea about the performance ofCrossThru.

  • 34 CHAPTER 5. BENCHMARKS FOR WANIC 56512

  • Chapter 6

    Benchmarking

    In this chapter I will describe how I ran the proposed benchmarks and I will show theconfiguration I used. Also there was a request to have as much automation as possible. So Ifocused on how to minimize the human interaction in running the benchmarks. As a resultof this, various script files are presented and included in the appendix. Nevertheless, theautomation was not always feasible.

    6.1 iperf

    As I mentioned before, iperf is a benchmark that requires Linux operating system. There-fore, we need to boot the Linux kernel first. This procedure is described in subsection4.2.2.1. For better performance, all available cores should be booted up. Thus, the coremask 0xfff is required.

    iperf is not a common part of the Linux kernel so I had to add it to the filesystem first.There were two options how to do it. The first option was to use aptitude to download thepackage from the Internet and save it to the chroot environment. But that would require toset up a outside network connection in our laboratory, proxy and etc. I used an approachthat adds iperf via the downloaded package directly into the root filesystem. This approachis described in subsection 4.2.2.

    For server I used the command iperf -s -w 64KB, and similarly for the client I usediperf -c 10.101.1.100 -w 64KB -l 64 -P 12 -t 120.

    The value of the window size was chosen based on results of various tests I performed.The length is a variable from the set defined by RFC 2544 [10]. The -P parameter reflectsthe fact that there are 12 cores on the chip. This means that each core will be assignedto one connection run in parallel. Without this parameter I got really poor results. Thiswas caused by the fact that all the computing was done on one core, and this core cannothandle such a load. The last parameter sets the duration of the test the default value of10 seconds seemed to be too short for trustworthy results.

    With iperf I successfully measured transmitting and receiving capabilities of the cardin the user-mode. The tests I performed used the TCP and UDP protocol with IPv4 andIPv6. The size of packets was in accordance to RFC 2544.

    35

  • 36 CHAPTER 6. BENCHMARKING

    For transmitting I used WANic 56512 as a generator and other PC as a receiver. Forreceiving I used the PC as a generator and WANic 56512 as a receiver.

    The level of automation for iperf is pretty high. The user just needs to run the serverand then a script. Separate results for each size of the packet is then saved to a file. TCPand UDP is measured separately. The source code of the scripts is included in the appendix.

    $ ./iperf_script_tcp - for the TCP$ ./iperf_script_udp - for the UDP

    6.2 netperf

    This program is very similar to the previous one. Therefore, the steps necessary to performmy measurements are alike. First, I added netperf the same way as I added iperf. Then,I booted the kernel with all available cores.

    I started the server with the command netserver. On the client side I usednetperf -H 10.101.1.100 -t TCP_STREAM -- -m 64 -s 64 -l 120 -D. The bounding toa specific CPU did not have any effect on the performance so I omitted this parameter. Ialso tried other tests, e.g. TCP_SENDFILE and UDPSTREAM.

    I was able to measure the potential of the card for transmitting and receiving. I usedboth protocols, TCP and UDP with IPv4 and IPv6, and all the packet sizes defined in RFC2544.

    In terms of automation I created a script that runs netperf for every packet size andall the three tests. The output is then stored to a new file.

    $ ./netperf_script

    6.3 curl-loader

    Unfortunately, I was not able to cross-compile curl-loader to the MIPS architecture. Therewere some problems with missing libraries, and when I cross-compiled them they did notwork on the MIPS architecture. On the other hand, this is not a big problem since I wasable to measure transmitting of packets using other benchmarks. And with curl-loaderthere is nothing except transmitting to measure.

    6.4 pktgen

    I already described what is needed to run pktgen in subsection 5.2.4. As a configurationscript I used an example from [13] and modified it to correspond with my configuration. Formulti-core environment I had to add following for each core:

  • 6.5. BRIDGE-UTILS 37

    PGDEV=/proc/net/pktgen/kpktgend_0pgset "add_device xaui0@0"....PGDEV=/proc/net/pktgen/xaui0@0pgset "pkt_size 60"pgset "flag QUEUE_MAP_CPU"

    Then, I booted to Linux kernel and added appropriate modules via modprobe. I wasable to measure packet generating for each packet size in respect to RFC 2544. For moreaccurate results I used the internal counters in the switch. Therefore, there is not much toautomate. The user just needs to run a script with the correct packet size.

    $ ./pktgen_ - where is the size of packets (64, 128, etc.)

    6.5 bridge-utils

    bridge-utils is the only benchmark to measure switching on Linux I could use. It wasquite easy to get bridge-utils working. First, I needed to change the kernel config asI mentioned in subsection 5.2.5. After booting the Linux kernel and creating the chrootenvironment, I could continue with setting up the bridge.

    ifconfig xaui0 0.0.0.0ifconfig xaui1 0.0.0.0brctl addbr "MYBRIDGE"brctl addif MYBRIDGE xaui0brctl addif MYBRIDGE xaui1ifconfig MYBRIDGE up

    These commands created the bridge from the two interfaces on the card. After that I justneeded to generate traffic at a specific speed and with desired parameters, such as packetsize. For this purpose I used the traffic generator created by Moris Bangoura in his thesis [9].The packets were received in one interface and transmitted in the other. Throughput andframe loss were obtained from internal counters on our H3C switch. For this measurementI used UDP traffic with a packet size of 64 1518B.

    Unfortunately, there is no way how to automate this measurement since I used theinternal counters in the switch to get results. The only thing which could be automated isthe process of generating the packets. This has been already done by Moris and I used hisscripts.

    6.6 traffic-gen

    traffic-gen is the first program that uses the Simple Executive API. It is a part of thecnusers SDK and it contains only the source code. Therefore, I needed to cross-compile itfirst. The cross-compilation can be done by:

  • 38 CHAPTER 6. BENCHMARKING

    cd /home/octeon/OCTEON-SDKsource env-setup OCTEON_CN56XX_PASS2cd examples/traffic-genmake

    This will generate a binary file which can be later booted as I already showed in sub-section 5.3.1. After a successful boot, I needed to enter commands to generate packets.Receiving is done automatically when traffic comes to one of the two interfaces on thecard. Bridging can be turned on by the command bridge. All the necessary commands areincluded in the scripts. Obtaining results is done by reading the values from a serial output.

    With traffic-gen I was able to successfully measure:

    TX, RX, throughput, frame loss with IPv4, TCP, packet size 64 1518B, all cores. TX, RX, throughput, frame loss with IPv4, UDP, packet size 64 1518B, all cores. TX, RX, throughput, frame loss with IPv6, TCP, packet size 64 1518B, all cores. TX, RX, throughput, frame loss with IPv6, UDP, packet size 64 1518B, all cores.

    Since we want to have things automated, I prepared a set of scripts. The user can issuethe script by typing:

    $ cat traffic__[v4|v6]_[tcp|udp] > /dev/ttyX

    where is the size of packets (64, 128, etc.) and X is the port of the serial cable.It is obvious that we need to use a serial connection to perform this measurement.

    Unfortunately, this brings some limitations to the automation there is no way how toretrieve the results from the console automatically. This has to be done manually.

    6.7 CrossThru

    CrossThru is a benchmark that can be used for fast forwarding or switching. It also usesthe Simple Executive API and runs in the Standalone mode as traffic-gen. CrossThruis a part of the SDK from GE and I already described how it can be cross-compiled insubsection 5.3.2.

    After loading the image into the memory and booting, the application does nothing.There is no input on the console and the application waits for incoming traffic. When trafficcomes to one of the interfaces, it is forwarded to the other interface. Since the card has onlytwo interfaces, we can talk about fast forwarding.

    As I mentioned in subsection 5.3.2, CrossThru can operate in two modes and thesemodes can be further optimized. I tried all four combinations to see which one can give usthe best results. I again used the generator made by Moris to generate traffic I could sendto WANic 56512. I used pktgen to generate UDP traffic with a packet size of 64 1518B.The results were obtained from the internal counters of the H3C switch.

    Since I used the internal counters, there is not much to automate. Again, the only thingthat could be automated, is the generation of packets using pktgen.

  • Chapter 7

    Analysis of the benchmarks

    7.1 iperf

    As I expected, the obtained results indicate that it is not really worth it to run not optimizedbenchmarks on WANic 5612. The measured values are so low also due to the fact that iperfruns in the user-space mode.

    Transmitting and receiving gave me very similar results. However, they do not meetthe expectations for 10 GbE networks. Figure 7.1 shows that iperf is not capable oftransmitting or receiving packets at wire speed. Figure 7.2 shows that for the UDP protocolI got even worse results. This was caused by a CPU bottleneck which is clearly visible inthe graph. I tried to distribute the load between the CPUs but it did not have any effecton the results. The CPUs were still used at 100%. Also there was no difference between theIPv4 and IPv6.

    I cannot compare my results with results from Moris or Radim because they did notuse iperf. And comparison between iperf and pktgen would not be useful since theseprograms run in different modes.

    7.2 netperf

    With netperf the situation is very similar to the one with iperf. I would not recommendthis benchmark to perform measurements on WANic 56512.

    As I mentioned previously, the advantage of netperf should be the availability to sendpieces of a file instead of a generated stream. As figure 7.3 shows, I did not get very differentresults. That was caused by the fact that there were no CPU bottlenecks for the TCP_STREAMmeasurement. And sending the file instead of the stream should ease the load of CPU.

    There was only a UDP_STREAM test available for the UDP protocol. Figure 7.4 showsthat this test gives me better results than when I used the TCP protocol. This happenedbecause UDP is easier to generate, transmit or receive than the TCP protocol. And also bythe fact that there were no CPU bottlenecks. The IPv4 and IPv6 gave me similar results.

    I have to note that I omitted the receiving part from the graphs since it was almostidentical to transmitting. And again, I cannot compare my results since Moris and Radimdid not use netperf.

    39

  • 40 CHAPTER 7. ANALYSIS OF THE BENCHMARKS

    7.3 pktgen

    From the results (figure 7.5) it is apparent that there is some sort of TX bottleneck. It isvery hard to identify the cause in this case, since I could not use ethtool to receive detailedstatistics. The processing load was divided among all cores and they were most of the timein idle mode (~92%). It can be due to unavailability of setting the affinity and interruptions.

    This tool could be used as a packet generator but only for larger sizes. It can generateat wire speed for packets of size 1024B and above.

    If I compare my results with the results from Radim and Moris, I got worse results thanthem. Radim can generate packets at wirespeed from 128B and Moris from 64B packets.Again, I have to point out the fact that they could optimize their network cards for thepktgen which I could not.

    7.4 bridge-utils

    This benchmark gave me very poor results. It can forward packets almost at wire speedfor sizes 512B and 1024B, and at wire speed for sizes 1280B and above. Figure 7.6 showscomparison between the theoretical maximum and my results. As can be seen in figure 7.7,64B packets are massively dropped unless the offered load goes to 10%.

    These unsatisfactory results are caused due to the fact that brctl application is notoptimized for WANic 56512. And concurrently, I couldnt perform any optimizations withWANic 56512 as Moris Bangoura did with Intel. The results I obtained are very similarto results presented by Radim Roka in his thesis. This fact leads me to a conclusion thatsimilar results can be used as a baseline for brctl benchmark for 10 GbE NIC when nooptimizations are made.

    7.5 traffic-gen

    With this tool I was able to verify that the card is capable of transmitting, receiving andfast-forwarding all packets at wire speed. I write fast-forwarding intentionally because thisprogram does not create the MAC table which is needed for switching. It just forwardspackets from one interface on the card to another one.

    There were no differences between IPv4 and IPv6, and also between the protocols TCPand UDP. I even tried different types of payload, such as random, ascending/descendingsequence, text, and all of this had no effect on the overall performance.

    This tool gave me the best possible results from all the benchmarks and also it has awide range of usage transmitting, receiving and fast-forwarding. I present only one picturefor traffic-gen (figure 7.8) because the graph for TX and RX would look the same, andfor frame loss it would be an empty graph.

    7.6 CrossThru

    I did not get the results as I would expect from a benchmark developed by the manufacturerof WANic 56512. In fact, the results were rather poor for an application developed with

  • 7.7. GRAPHS 41

    Simple Executive API. I can only guess why the results are that bad, since I could not runthe diagnostic program for CrossThru. The technical support of GE Intelligent Platformscould not solve my problem with dbgse utility. My guess is that the application is not yetfully optimized for Cavium processors as traffic-gen is it is still version 0.2

    But even these poor results were better than with brctl. I noticed that there wereonly slight differences between the Fast Forwarding and Layer 2 Ethernet Switching mode.Therefore all of the graphs are for both modes. But there were visible improvements betweenthe basic and optimized version.

    Figure 7.9 shows that the optimized mode is capable of switching 256B packets andabove at wire speed. The basic mode can do the same for packets starting at size 512B.Figures 7.10 and 7.11 show that 64B packets are heavily dropped until the offered load islowered to 30%.

    7.7 Graphs

    In this section I present all the graphs. The unit of frame rate is Packets per Second (pps).

    Figure 7.1: TX and RX test - TCP - iperf

    As we can see from the obtained results, WANic 56512 is not very suitable candidate touse with the Linux operating system. That could be expected since there are no mechanismsto optimize Linux programs to run on the card.

    But the results also show that WANic 56512 is capable of transmitting, receiving andfast-forwarding 64B packets at wirespeed. And I believe that the card has a great potentialif the right, optimized software is run on it. If we would have more money to buy thesoftware toolkits from Cavium Networks [6], we could run more tests on the card. There aretoolkits available for TCP/IP, IPsec, protocol analysis, IDS/IPS, etc. It is clear that usingthe card with software that has no optimization for WANic 56512 or the OCTEON chip, isreally wasting with the potential of the card.

  • 42 CHAPTER 7. ANALYSIS OF THE BENCHMARKS

    Figure 7.2: TX test - UDP - iperf

    Figure 7.3: TCP_STREAM and TCP_SENDFILE test - netperf

  • 7.7. GRAPHS 43

    Figure 7.4: UDP_STREAM test - netperf

    Figure 7.5: TX test - pktgen

  • 44 CHAPTER 7. ANALYSIS OF THE BENCHMARKS

    Figure 7.6: RFC 2544 - throughput test - brctl

    Figure 7.7: RFC 2544 - frame loss test - brctl

  • 7.7. GRAPHS 45

    Figure 7.8: RFC 2544 - throughput test - traffic-gen

    Figure 7.9: RFC 2544 - throughput test - CrossThru FF+L2

  • 46 CHAPTER 7. ANALYSIS OF THE BENCHMARKS

    Figure 7.10: RFC 2544 - frame loss test - CrossThru FF+L2 basic

    Figure 7.11: RFC 2544 - frame loss test - CrossThru FF+L2 optimized

  • Chapter 8

    Conclusion

    All the requested tasks were successfully accomplished. The installation guide provides indetail all the necessary steps to install WANic 56512 into the PC in our laboratory. Thiswill help to significantly reduce the required time for deployment of the card in future work.This guide also shows how to install additional software for the card which will be needed infuture development. A part of this thesis then covers the basis of packet flow and hardwareacceleration units that are needed to get an understanding of how the accelerators work.

    A various set of benchmarks was proposed in this thesis to test multiple parameters ofWANic 56512. Although the results were not completely satisfactory, they helped me tocome up with several conclusions. The first conclusion is that WANic 56512 is not built towork with benchmarks or programs that are programmed for the Linux operating system.It is due to the fact that these programs cannot use the hardware acceleration units that areincluded on the card. Therefore, and this the second conclusion, we have to buy or developprograms for WANic 56512 in particular. I tried to get such software for our university. Iasked Cavium Networks whether they can give us the software they used to get the resultsthey present in their product brief [4]. Unfortunately, the marketing team said that they donot provide free support, it is not in accordance with their business model. And the resultspresented in the product brief cannot be obtained unless we buy additional software fromthem [6].

    Still, the mentioned benchmarks gave me a basic idea of what WANic 56512 is capableof. This card can at least generate, transmit, receive and fast-forward 64B packets at wirespeed which is about 14.8 Mpps. But to achieve such results, the optimized software needsto be run.

    In terms of the comparison with results from Radim Roka and Moris Bangoura, WANic56512 lies somewhere in between. The card gave me better results than Roka presents inhis thesis and similar results that Bangoura got with the use of GPUs. But Moris wentin his measurements even further, he was able to perform routing and firewalling on hisprototype. I could not perform such tests because we do not have required software forthese tasks. And we also have to consider the price of proposed solutions. WANic 56512was 129.000 K and the SDK was about 50000 K. On the other hand, Moris needed twoIntel 10 GbE network cards each was 15000 K and two graphic cards, 15000 K for each.

    47

  • 48 CHAPTER 8. CONCLUSION

    8.1 Future work

    I think that there could be many following projects with this card. But first, we have to buythe SDK from Cavium Networks to be able to develop our own programs for WANic 56512and the OCTEON chip. Right now, we do not have the required documentation about allthe simple executive functions that are required in development of new programs.

    One of the projects could be in a cooperation with GE Intelligent Platforms to helpthem develop and optimize the CrossThru application. This will result in an applicationthat could be capable of switching 64B packets at wire speed. Then, we could developa program for routing and firewalling, and possibly getting even better results than theprototype made by Moris. And last but not least, we could use WANic 56512 for a deeppacket inspection and as a packet analyzer in realtime.

  • Bibliography

    [1] Users Guide Debug Simple Executive Utility for Linux, 2011.

    [2] Users Guide OCTEON CrossThru Application, 2011.

    [3] Cavium Networks - Products > OCTEON Plus MIPS64 Processors > Silicon [online].2012. [cit. 8. 12. 2012]. .

    [4] OCTEON R Plus CN56XX 8 to 12-Core MIPS64-Based SoCs [online]. 2011.[cit. 9. 12. 2012]. .

    [5] Reference Manual WANic*-56512 Packet Processor, 2011.

    [6] CSS Software Toolkits [online]. 2011. [cit. 9. 12. 2012]. .

    [7] Octeon Simple Executive based Traffic Generator, 2012.

    [8] WANic 56512 Packet Processor [online]. 2010. [cit. 8. 12. 2012]. .

    [9] BANGOURA, M. 10GbE Routing on PC with GNU/Linux [online]. 2012. .

    [10] BRADNER, S. MCQUAID, J. Benchmarking Methodology for Network Intercon-nect Devices. RFC 2544 (Informational), March 1999. . Updated by RFC 6201.

    [11] CURTIS, J. OCTEON R Programmers Guide, 2010.

    [12] MENG, J. et al. Towards high-performance IPsec on cavium OCTEON platform. InProceedings of the Second international conference on Trusted Systems, INTRUST10,s. 3746, Berlin, Heidelberg, 2011. Springer-Verlag. doi: 10.1007/978-3-642-25283-9_3.. ISBN 978-3-642-25282-2.

    [13] OLSSON, R. pktgen examples [online]. 2008. [cit. 9. 12. 2012]. .

    [14] OLSSON, R. [PATCH] pktgen: multiqueue etc. [online]. 2008.[cit. 9. 12. 2012]. .

    49

  • 50 BIBLIOGRAPHY

    [15] ROHRBACHER, M. Measurement of throughput of Ethernet Cards on TelumNPA-5854 device [online]. 2010. .

    [16] ROKA, R. Performance evaluation of GNU/Linux network bridge [online]. 2011..

    [17] Wikipedia contributors. MIPS Architecture (Pipelined) [online]. 2009. [cit. 9. 12. 2012]..

  • Appendix A

    Scripts

    In this appendix I will show examples of the scripts I made for automation.

    A.1 iperf

    SIZE=(64 128 256 512 1024 1280 1518)HOST=10.101.1.100

    for size in ${SIZE[*]} ; doecho $size - iperf -c $HOST -w 64KB -l $size -f b -P 12 -t 120 \| grep SUM | cut -d" " -f12 | tee -a iperf_tcp

    done

    A.2 netperf

    SIZE=(64 128 256 512 1024 1280 1518)HOST=10.101.1.100

    echo TCP_STREAM > netperf_tcp

    for size in ${SIZE[*]} ; do

    echo $size - netperf -H $HOST -t TCP_STREAM -- -m $size \-s $size -D | grep $size | tee -a netperf_tcp

    done

    echo TCP_SENDFILE >> netperf_tcp

    for size in ${SIZE[*]} ; do

    echo $size - netperf -H $HOST -t TCP_SENDFILE -F big_file.iso -- \-m $size -s $size -D | grep $size | tee -a netperf_tcp

    51

  • 52 APPENDIX A. SCRIPTS

    done

    echo UDP_STREAM > netperf_udp

    for size in ${SIZE[*]} ; do

    echo $size - netperf -H $HOST -t TCP_STREAM -- -m $size \-s $size | grep $size | tee -a netperf_udp

    done

    A.3 pktgen

    #! /bin/bash

    #rmmod pktgenmodprobe pktgen

    # PACKET SIZE - NIC adds 4 bytes CRCPKT_SIZE="pkt_size 60"

    COUNT="count 0" # pkts to send, 0 is infinityDELAY="delay 0" # delay 0 means maximum speed

    # thread configSIRQ=1000

    CLONE_SKB="clone_skb 8"

    function pgset() {local result

    echo $1 > $PGDEV

    result=cat $PGDEV | fgrep "Result: OK:"if [ "$result" = "" ]; then

    cat $PGDEV | fgrep Result:fi

    }

    function pg() {echo inject > $PGDEVcat $PGDEV

    }

  • A.3. PKTGEN 53

    # Config Start Here ---------------------------------------------------

    PGDEV=/proc/net/pktgen/kpktgend_0pgset "rem_device_all"pgset "add_device xaui0@0"pgset "max_before_softirq $SIRQ"

    PGDEV=/proc/net/pktgen/kpktgend_1pgset "rem_device_all"pgset "add_device xaui0@1"pgset "max_before_softirq $SIRQ"

    ...

    PGDEV=/proc/net/pktgen/kpktgend_11pgset "rem_device_all"pgset "add_device xaui0@11"pgset "max_before_softirq $SIRQ"

    PGDEV=/proc/net/pktgen/xaui0@0echo "Configuring $PGDEV"pgset "$COUNT"pgset "$CLONE_SKB"pgset "$PKT_SIZE"pgset "$DELAY"pgset "dst_min 10.0.2.100"pgset "dst_max 10.0.2.149"pgset "udp_dst_min 1"pgs