To my mother and father

Acknowledgments

I express my sincere gratitude towards Prof. R. Govindarajan, my research super-

visor, for his invaluable support and guidance. I sincerely thank him for being very

patient and for being available despite his busy schedule. I am also extremely grate-

ful to Prof Joy Kuri for providing invaluable guidance and suggestions especially

in the networking part of this thesis. I am also thankful to Prof W. M. Zuberek,

Professor in University of Newfoundland Canada, for allowing me to use CNET. I

also thank the Chairman of SERC for providing the excellent laboratory facilities

and a wonderful atmosphere in the department.

I also thank:

Kaushik Rajan for the being involved in a large number of interesting discussions

ranging from network processors to controversial cricket and uncontroversial tennis

greats. Shyam for providing subsidized, and in most cases free, weekend lunch and

dinner. Gvsk for helping me to get familiarized with CNET and for tolerating our

eccentricities. Rajesh for allowing me to work on ”my” machine. Manikantan for

wonderful discussions ranging from cricket, tennis, politics and football. Soccer-

pulse community for sharing the wonderful soccer videos. The organizers of the

FIFA world cup, Champions League, and EURO for providing the ”Jogo Bonito”

moments.

I would like to thank Him, The almighty, for providing me with this rare op-

portunity. Last but not the least, I thank all my family members for providing the

moral support and for being patient through the course of this work.

Abstract

In recent years there has been an exponential growth in Internet traffic resulting in

increased network bandwidth requirements which, in turn, has led to stringent pro-

cessing requirements on network layer devices like routers. Present backbone routers

on OC 48 links (2.5Gbps) have to process four million minimum-sized packets per

second. Further, the functionality supported in the network devices is also on the

increase leading to programmable processors, such as Intel’s IXP, Motorola’s C5

and IBM’s NP. These processors support multiple processors and multiple threads

to exploit packet-level-parallelism inherent in network workloads.

This thesis studies the performance of network processors. We develop a Petri

Net model for a commercial network processors (Intel IXP 2400,2850) for three

different applications viz., IPv4 forwarding, Network Address Translation and IP

security protocols. A salient feature of the Petri net model is its ability to model

the application, architecture and their interaction in great detail. The model is

validated using the Intel proprietary tool (SDK 3.51 for IXP architecture) over a

range of configurations. Our Performance evaluation results indicate that

1. The IXP processor is able to support a throughput of 2.5 Gbps for all modeled

applications.

2. Packet buffer memory (DRAM) is the bottleneck resource in a network proces-

sor and even multithreading is ineffective beyond a total of 16 threads in case of

header processing applications and beyond 32 threads for payload processing

applications.

Abstract iii

Since DRAM is the bottleneck resource we explore the benefits of increasing

the DRAM banks and other software schemes like offloading the packet header to

SRAM.

The second part of the thesis studies the impact of parallel processing in network

processor on packet reordering and retransmission. Our results indicate that the con-

current processing of packets in a network processor and buffer allocation schemes

in TFIFO leads to a significant packet reordering, (61%), on a 10-hop network (with

packet sizes of 64 B) which in turn leads to a 76% retransmission under the TCP fast-

restransmission algorithm. We explore different transmit buffer allocation schemes

namely, contiguous, strided, local, and global for transmit buffer which reduces the

packet retransmission to 24%. Our performance results also indicate that limiting

the number of microengines can reduce the extent of packet reordering while pro-

viding the same throughput. We propose an alternative scheme, Packetsort, which

guarantees complete packet ordering while achieving a throughput of 2.5 Gbps. Fur-

ther, we observe that Packetsort outperforms, by up to 35%, the in-built schemes in

the IXP processor namely, Inter Thread Signaling (ITS) and Asynchronous Insert

and Synchronous Remove (AISR).

The final part of this thesis investigates the performance of the network processor

in a bursty traffic scenario. We model bursty traffic using a Pareto distribution.

We consider a parallel and pipelined buffering schemes and their impact on packet

drop under bursty traffic. Our results indicate that the pipelined buffering scheme

outperforms the parallel scheme.

Contents

Acknowledgments i

Abstract ii

1 Introduction 1

1.1 Network Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Performance Evaluation and Architecture Exploration. . . . . 31.2.2 Impact of Packet Reordering . . . . . . . . . . . . . . . . . . . 41.2.3 Performance under Bursty Traffic . . . . . . . . . . . . . . . . 6

1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 8

2.1 Network Processors: An overview . . . . . . . . . . . . . . . . . . . . 82.1.1 IXP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Motorola C-5 Processor . . . . . . . . . . . . . . . . . . . . . 132.1.3 IBM PowerNP Network Processor . . . . . . . . . . . . . . . . 14

2.2 Network Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 IP Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.2 Network Address Translation . . . . . . . . . . . . . . . . . . 152.2.3 IP Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Petri Nets: An Introduction . . . . . . . . . . . . . . . . . . . . . . . 16

3 Performance Modeling and Evaluation 20

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 A Single Microengine Petri Net Model . . . . . . . . . . . . . . . . . 21

3.2.1 Multiple Microengine Petri Net Model . . . . . . . . . . . . . 243.2.2 Memory Modeling . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Performance Evaluation of IXP . . . . . . . . . . . . . . . . . . . . . 263.3.1 Simulation Methodology . . . . . . . . . . . . . . . . . . . . . 273.3.2 Validation Results . . . . . . . . . . . . . . . . . . . . . . . . 283.3.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.4 Architecture Exploration . . . . . . . . . . . . . . . . . . . . . 353.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

CONTENTS v

4 Packet Reordering in Network Processors 43

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Packet Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.1 Reordering in Network Processors . . . . . . . . . . . . . . . . 454.2.2 Transmit Buffer Induced Reordering . . . . . . . . . . . . . . 464.2.3 Packet Ordering Mechanisms in IXP . . . . . . . . . . . . . . 474.2.4 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Packet Reordering in IXP . . . . . . . . . . . . . . . . . . . . . . . . 504.3.1 Petri Net Model . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3.3 Performance Results . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 Reducing Packet Reordering . . . . . . . . . . . . . . . . . . . . . . . 544.4.1 Buffer Allocation Schemes . . . . . . . . . . . . . . . . . . . . 544.4.2 Tuning Architecture Parameters . . . . . . . . . . . . . . . . . 584.4.3 Packet Sort: An Alternative Scheme . . . . . . . . . . . . . . 60

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 Performance Analysis of Network Processor in Bursty Traffic 66

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Generation of Bursty Traffic . . . . . . . . . . . . . . . . . . . . . . . 685.3 Petri net Model of the Traffic Generator . . . . . . . . . . . . . . . . 705.4 Packet Buffering Schemes . . . . . . . . . . . . . . . . . . . . . . . . 715.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.5.1 Impact of Packet Buffering . . . . . . . . . . . . . . . . . . . . 725.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6 Related Work 77

6.1 Network Processor Performance Evaluation. . . . . . . . . . . . . . . 776.2 Packet Reordering in Network Processors. . . . . . . . . . . . . . . . 79

7 Conclusions 82

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Bibliography 86

List of Figures

2.1 Internal IXP 2400 Architecture. . . . . . . . . . . . . . . . . . . . . . 92.2 Microengine - Memory Unit Interface in IXP 2400. . . . . . . . . . . 12

2.3 Packet flow in the IXP 2400. . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Petri Net Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Petri Net Model for a Single Microengine in IXP 2400 Running IPv4Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Petri Net model for Memory Access in DDR DRAM. . . . . . . . . . 25

3.3 Petri Net model for Memory Access in Rambus DRAM. . . . . . . . . 263.4 Transmit Rates from PN and SDK Simulations. . . . . . . . . . . . . 30

3.5 Microengine Utilization from PN and SDK Simulations. . . . . . . . . 31

3.6 DRAM utilization for Different Bank Probabilities. . . . . . . . . . . 323.7 Average Microengine Queue Length for Different Bank Probabilities. 34

3.8 Impact of Number of DRAM Banks. . . . . . . . . . . . . . . . . . . 36

3.9 Impact of Number of Hash Units. . . . . . . . . . . . . . . . . . . . . 373.10 Performance Enhancements from Storing Packet Header in SRAM for

IP4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.11 Performance Enhancements from Storing Packet Header in SRAM for

NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.12 Impact of Limiting Pending DRAM Accesses per Microengine . . . . 40

4.1 Packet Reordering in Network Processors. . . . . . . . . . . . . . . . 444.2 Transmit Buffer Reordering. . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Inter Thread Signaling in the IXP. . . . . . . . . . . . . . . . . . . . 48

4.4 Asynchronous Insert Synchronous Reset (AISR) in the IXP. . . . . . 494.5 Simulated Network Topology. . . . . . . . . . . . . . . . . . . . . . . 51

4.6 Packet Reordering in NP. . . . . . . . . . . . . . . . . . . . . . . . . . 534.7 Different Transmit Buffer Allocation Schemes. . . . . . . . . . . . . . 54

4.8 Impact of Various Buffer Allocation Schemes (64B Packet Size) -CNET result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.9 Impact of Various Buffer Allocation Schemes (512B Packet Size) -CNET Result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.10 Impact of Number of Microengines (64B Packet Size) - CNET Result. 58

4.11 Impact of Number of Microengines (512B Packet Size) - CNET Result. 59

LIST OF FIGURES vii

4.12 Impact of Number of Threads (64B Packet Size) - CNET Result. . . . 604.13 Impact of Number of Threads (512B Packet Size) - CNET Result. . . 614.14 Packet Sort Implementation in the IXP. . . . . . . . . . . . . . . . . 61

5.1 Packet Arrival in NP. . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2 Bursty Traffic Generation. . . . . . . . . . . . . . . . . . . . . . . . . 695.3 Petri Net Model of Traffic Generator. . . . . . . . . . . . . . . . . . . 705.4 Pipelined Buffering Scheme. . . . . . . . . . . . . . . . . . . . . . . . 715.5 Bursty Traffic Generated using 48 sources. . . . . . . . . . . . . . . . 73

List of Tables

3.1 Model parameters used in the Petri net model . . . . . . . . . . . . . 283.2 Time Average DRAM Queue Length and Stall Percentage. . . . . . . 32

4.1 Petri Net Model Validation . . . . . . . . . . . . . . . . . . . . . . . . 524.2 Impact of Buffer Allocation schemes on Throughput. . . . . . . . . . 574.3 Transmit Rates for Different Number of Threads. . . . . . . . . . . . 594.4 Comparison of Various Schemes to Overcome Reordering. . . . . . . . 62

5.1 Output Line Rates supported with Input Rate of 1.7 Gbps . . . . . . 745.2 Output Line Rates supported with Input Rate of 3.14 Gbps . . . . . 755.3 Output Line Rates supported with Input Rate of 6 Gbps . . . . . . . 755.4 Maximum Output Rates for Different Packet Buffering Schemes. . . . 75

Chapter 1

Introduction

1.1 Network Processors

In recent years there has been an exponential growth in Internet traffic leading to

increasing network bandwidth requirements. For example, present backbone routers

on OC 48 links (2.5Gbps) have to process four million minimum-sized packets per

second. Further, the applications are also changing with VOIP (Voice over Internet

Protocol) and P2P (Peer to Peer) applications gaining increasing popularity. Fur-

ther, with the advent of IPSec [21] there is a need to support encryption/decryption

at the network layer. These growing functionalities at the network layer and the

increasing bandwidth requirements has resulted in application specific devices to be

deployed at the network layer. Network processors [14, 15, 26, 13] are application

specific processors that are specialized to perform network layer functionalities like

IPv4 [2], NAT [34]. These processors perform key computational functions like

encryption/decryption in hardware. However, these processors, unlike the ASICs,

are programmable and hence are easily adaptable to changing network standards

and applications. This helps in reducing the Time to Market (TTM) compared to

ASICs. As a result number of network processors have proliferated the market re-

cently [14, 15, 26, 13].

2 Introduction

Commercial network processors [13] [14] are store-forward architectures that

buffer incoming packets in a buffer memory (usually the DRAM), process the pack-

ets, and forward it to the corresponding output port. Networking applications

exhibit packet level parallelism where the processing of different packets are in-

dependent. Hence network processors exploit this characteristic of applications by

employing multiple processors to process packets. Further, they employ hardware

level multithreading to mask latencies in accessing memory or other application

specific functional units. Hence network processors support multiple threads and

have hardware support for low-overhead context switching. For example, the Intel

IXP 2400 processor [14] [15] uses a total of 64 threads (8 threads per microengine)

to process packets. This enables modern routers to support OC-48 and higher line

rates.

The performance evaluation of network processors is complex due to the inter-

action among multithreading, multiple processors, complex memory structures with

varying access time, and application specific functional units. Earlier work on per-

formance evaluation of network processors use either standard queuing models [9],

other analytical models [1] or simulation-based approaches [29]. However, many

of these work assume that the packets are already buffered in the DRAM and do

not model the flow of packets in and out of DRAM. However, since DRAM is a

critical resource which can potentially affect the performance of NPs, due to the

high latency involved in an access, not modeling the flow of packets in and out of

DRAM affects the accuracy of these studies. We address the problem by developing

a detailed Petri net model for network processing applications running on NPs.

Network processors use multiple threads and microengines to exploit packet level

parallelism in network applications. While this can improve the throughput or

performance of NPs it can also adversely affect the packet order at the output of the

network processor. It has been shown in earlier works the adverse impact of packet

reordering on TCP throughput. However, earlier work on performance evaluation

1.2 Our Contribution 3

of network processors [9, 29] do not study this issue. Similarly earlier work on

packet reordering [5, 24, 19] study the impact of packet reordering/retransmission

on TCP throughput. However, these works do not consider the impact of the network

processor architecture on packet reordering/retransmission.

1.2 Our Contribution

1.2.1 Performance Evaluation and Architecture Exploration.

In the first part of this thesis we develop a Petri net model of the IXP 2400/2800

processor running three different network applications. The model captures the

packet flow in detail from the Receive FIFO to the Transmit FIFO. The main fea-

ture of the model, unlike other Petri net models [11] [33], is its ability to model

the processor, application and their interaction in sufficient detail. The Petri net

model is different for different applications. We consider header processing appli-

cations (HPA) such as IPv4 and NAT and payload processing applications (PPA)

such as the IPSec protocols. The IXP processor is able to achieve a throughput

of 2.96 Gbps for HPA and 3.6 Gbps in case of PPA applications. The Petri net

model thus developed is validated using the Intel simulator for the IXP family of

processors. Our performance results indicate that under Poisson packet arrival pro-

cess with minimum sized packets the DRAM memory used for packet buffering is

the bottleneck. Our study also shows that multithreading is effective only up to a

certain number of threads. Beyond this threshold packet buffer memory (DRAM) is

fully utilized and multithreading is not beneficial. Since the transmit rate is limited

by the packet buffer memory utilization, we investigate the following approaches to

reduce the memory utilization.

• In the IXP processor, although the DRAM is utilized 100% the SRAM is

utilized only upto 27%, hence we explore placing the packet header in SRAM

and packet payload in DRAM. This scheme improves the throughput by up

to 20%.

4 Introduction

• Increasing the number of DRAM banks from 4 to 8 improves the throughput

by upto 3.6 Gbps. However when the number of banks is 8, the hash unit, a

task specific unit used for performing hardware lookup, becomes the bottle-

neck. Increasing the number of hash units from 1 to 2 gives an improvement

in the throughput by upto 60% as compared to the base case. We further ob-

serve that even with fewer microengines (4 MEs) and two hash units a similar

performance can be sustained.

• When the number of outstanding memory requests in the IXP processor ex-

ceeds a threshold then all microengines with memory requests at the head of

their command FIFO are stalled. Instead if the number of pending memory

requests from each microengine is limited then an improvement in transmit

rate by up to 4.1 Gbps can be achieved.

1.2.2 Impact of Packet Reordering

In the second part of the thesis we analyze the impact of the network processor

architecture on packet reordering. Our study indicates that in addition to the con-

current processing in the network processor, the allocation in the transmit buffer

also adversely impacts packet ordering. Our results indicate that concurrent pro-

cessing and naive buffer allocation can result in 31% packet reordering which, in

turn, results in 6% retransmission of packets in a single hop for IPv4 application.

The reordering and retransmission rates are measured as the potential number of

ACK replies received by the sender. We observe that the reordering and retransmis-

sion rates increase with the number of hops, resulting in up to 61% retransmission

in case of 10 hops.

We explore different transmit buffer allocation schemes namely, contiguous, strided,

local, and global allocation schemes. In the global buffer allocation, threads from

1.2 Our Contribution 5

different microengines compete for the same transmit buffer space. Hence, it in-

volves a mutual exclusion operation across all the microengines and threads. This

reduces the retransmission rate in a 10 hop network to 33%, but also drastically

reduces the throughput to 1 Gbps, which is unacceptably low. Hence we explore

a buffer allocation scheme where only threads from the same microengine compete

for a common buffer space. This scheme is called local buffer allocation. Here the

mutual exclusion operation is limited to only threads within a microengine. This

scheme results in a retransmission rate of 45% but with a transmit rate of 2.1 Gbps.

In the strided buffer allocation, the transmit buffer space is allocated completely

statically such that each microengine writes into the transmit buffer locations that

are apart by a constant stride. This eliminates the mutual exclusion and a through-

put of 2.96 Gbps is obtained. However the retransmission is high and is 56% on a

10-hop network. Since a packet traverses 16 hops, on an average [28] in the Internet,

the observed retransmission rates can significantly affect the network performance.

Our results also indicate that the parallel architecture of the network processor

can severely impact reordering and can cause up to 61% retransmission in a 10 hop

scenario. Since our performance study in the first part of the thesis reveals that de-

creasing the number of microengines from 8 to 4 and keeping the number of threads

to 16 does not degrade the performance, we study the impact of packet reordering

on a network processor with a fewer number of microengines. The retransmission

rate reduces from 61%, for a network processor with 8 microengines and 8 threads,

to 19% for a network processor with 2 microengines and 8 threads or 4 microengines

and 4 threads. This is achieved without sacrificing the throughput (2.96 Gbps). This

is because the throughput of the network processor saturates beyond a total of 16

threads. Further, to reduce retransmission rates we propose a scheme, Packet sort,

in which a few microengines/threads are dedicated to sort the packets in-order. We

compare the performance of Packet sort with the existing in-order schemes in the

IXP namely, Inter Thread Signaling (ITS) and Asynchronous Insert Synchronous

6 Introduction

Reset. The Packet sort achieves a throughput of 2.3 Gbps and performs better than

ITS and AISR which achieve a throughput of 2.1 and 1.1 Gbps respectively.

1.2.3 Performance under Bursty Traffic


in a bursty traffic scenario. Earlier works on performance evaluation of network

processors [29] [9] evaluate the performance under a Poisson packet arrival process in

a DoS attack scenario. However, earlier works [8] on traffic characterization indicate

that, on an average, only 10% of the traffic is due to DoS attacks. This work studies

the performance of the network processor in a bursty traffic. We model bursty traffic

using a Pareto distribution. Further, we explore various packet buffering schemes. In

particular we consider a parallel and pipelined packet flow architecture. Our results

indicate that the parallel scheme incurs considerable packet drop and results in a

lower throughput. In contrast, the pipelined scheme results in a higher throughput

and a lower packet drop.

1.3 Organization of the Thesis

The rest of the thesis is organized as follows. The following chapter presents the

IXP architecture in detail and also discusses the different network applications used

in our study. This chapter also provides an introduction to Petri nets. Chapter

3 presents the Petri net model of the network processor and model for a memory

access. This chapter also presents the validation results of the Petri net model and

the architecture exploration of the network processor. Chapter 4 studies the impact

on packet reordering in network processors. This chapter also presents different

ways to reduce packet reordering. In Chapter 5 we evaluate the network processor

in a bursty traffic scenario and also evaluate the performance of different packet

buffering schemes. Chapter 6 discusses the related work in this area. We present

1.3 Organization of the Thesis 7

our conclusions and directions of future work in chapter 7.

Chapter 2

Background

In this chapter we provide the necessary background on :

1. The network processor in general and the Intel IXP processor in particular

and,

2. The network applications that are used in this study, and

3. Petri nets.

This chapter is organized as follows. Section 2.1 provides an overview of different

network processors and describes the IXP2400 architecture and the packet flow in

the IXP processor. Section 2.2 presents the different applications that are running

in the IXP processor. Section 2.3 provides the necessary background on Petri nets.

2.1 Network Processors: An overview

Commercial network processors [13] [14] are store-forward architectures that buffer

incoming packets in a buffer memory (usually the DRAM), process the packets and

forward it to the corresponding output port. This section provides an architecture

overview of the IXP 2400 processor [14] and also provides a brief overview of Mo-

torola C-5 [26] and IBM Power-NP processors [13].

2.1 Network Processors: An overview 9

2.1.1 IXP Architecture

IXP processors [15] are multithreaded multiprocessor architectures which are typi-

cally employed in backbone routers.

Figure 2.1 Internal IXP 2400 Architecture.

Hash Unit M

E

Media Switch Fabric (MSF)

SRAM Channel

Intel XScaleCore

SRAMDRAM

NETWORK PORTS

Scratch padMemoryTBUF

DRAM ChannelRBUF

ME

ME

ME

ME

ME

ME

ME

The architecture of IXP 2400 processor depicted in Figure 2.1, consists of a

Xscale core, eight microengines, and application specific hardware units like hash

and crypto unit.

The Xscale is a 32 bit RISC processor, used to handle control and management

plane functions [38] like routing table update, to load the microengine instructions.

The XScale initializes and manages the chip and also handles exception. IXP 2400

contains eight microengines each running at 600 MHz. The microengines are spe-

cialized to perform network processing. Each microengine contains eight hardware

contexts, for a total of 64 threads and there are no context switch overhead. There

are 256 programmable General Purpose Registers in each microengine equally shared

between the eight threads. Further, there is a 4K instruction store associated with

each microengine.

The memory architecture of the IXP processor consists of SRAM, DRAM, scratch-

pad memory, and local memory. Typically packets are buffered in DRAM while the

10 Background

SRAM stores state information like the routing table, NAT table. IXP 2400 sup-

ports DRAM and SRAM sizes up to 512 MB and 8 MB respectively. The RAMs

are off chip and communicate with the processor using a high speed data path of

bandwidth 6.4 Gbps (for IXP 2400). The scratchpad is used for communication

between the different microengines like the mutex variables. The scratchpad is a

low latency memory and is on-chip. Additionally, each microengine consists of 640

words of local memory which is used for communication between hardware contexts.

The IXP 2400 and 2800 also use Next Neighbour registers which are used for com-

munication between adjacent microengines.

The off chip memory namely, SRAM and DRAM accesses, are accessed through

memory controllers which are resident on the IXP chip. There are independent

controllers for SRAM and DRAM memory. A thread requesting a memory access

enqueues its request in the corresponding memory controller. The controller sends

the request, the memory address, to the SRAM/DRAM and sends/receives data

through the external data bus which has a bandwidth of 6.4 Gbps. The memory

controllers form the interface between the microengines and the memory.

The IXP chip contains task specific functional units, hash unit (in IXP 2400 and

2850) and crypto unit (in IXP 2850), accessible by all the microengines. The hash

unit can be used in a hash based destination address lookup [29]. The IXP2850

contains two crypto units which implement 3DES, AES, SHA-1 algorithms in hard-

ware. When a thread in a ME requests for a hash/crypto computation , a context

switch occurs.

The IXP chip has a pair of FIFOs, Receive FIFO and Transmit FIFO, used to

send/receive packets to/from the network ports, each of size 8 KB. A data path

exists between the microengines and DRAMs to the FIFOs. The packet, which gets

buffered in the RFIFO, is moved to the DRAM through this data path. Similarly,


packets are moved out of the DRAM to the TFIFO through this data path.

Other processors in the IXP family include the IXP 1200, 2800, and 2850. These

processors have similar structure to the IXP 2400 but differ in the number of ME’s

and specialized hardware units. The IXP 1200 contains six ME’s with four threads

per ME. The IXP 2800 has sixteen MEs with each ME supporting eight threads.

The IXP 2850 is similar to IXP 2800 but additionally it has a crypto unit which

implements cryptographic algorithms in hardware.

The IXP processor uses a unique mechanism to access memory (SRAM and

DRAM). A detailed understanding of this interaction is needed as the memory

latency, will be discussed in the subsequent chapter, limits the network processor

throughput.

2.1.1.1 Microengine-Memory Unit Interaction

When a thread in a microengine requests a memory access or an access to hash or

crypto unit, places an appropriate request in the respective microengine command

queue (MEQ) (refer to Figure 2.2). A maximum of four outstanding requests can

be placed in a single microengine command queue. A common command bus arbiter

moves the request from the microengine command queue to the respective queues

for task unit or memory units.

Requests in the task units are processed in FIFO order. In case of memory

(DRAM), memory accesses to the same bank are processed in FIFO order. Each

queue allows a maximum of sixteen requests to be placed in its queue. If any of

the memory/task queues fills to a threshold level (a queue length of 10) then the

corresponding unit (memory/task unit) applies a back pressure mechanism on the

command bus arbiter [14]. This prevents further issue of requests from all micro-

engines containing a request of this type at the head of their command queue. This

consequently can fill the microengine command queue (MEQ). So in this scenario

12 Background

Figure 2.2 Microengine - Memory Unit Interface in IXP 2400.

.......................

CMD BUS ARBITER

SRAMDRAM

DRAM Q SRAM Q

MEQMEQ

ME0 ME7

a thread in a microengine attempting to place a request in the command queue

(MEQ) stalls since the queue is full. Our performance results indicate that these

stalls result in significant wastage of microengine clock cycles since other threads

waiting to execute in the microengine are prevented in doing so.

The following section describes the packet flow in the IXP processor. We assume

that packets have arrived in the input MAC.

2.1.1.2 Packet Flow in IXP Processors

Packets arrive from the external link to the input ports and get buffered at the input

port buffers (refer to Figure 2.3). Packets are then transferred to the RFIFO of

the NP through a high speed media interface. When a thread in a microengine is

available it takes control of the packet and transfers the packet from the RFIFO to

the DRAM. The packet/packet header is read from the DRAM by the corresponding

thread. The thread processes the packet/header, modifies it as necessary, and writes

back the new packet/header to the DRAM. Next the thread places the packet in

the TFIFO of the NP and writes the packet to the corresponding output port buffer


Figure 2.3 Packet flow in the IXP 2400.

RFIFO DRAM

MAC

DRAM ME TFIFO

OC XX LINE

through the media interface. Once the packet is transferred to the next hop the

thread is freed. It should be noted that during the packet flow from the RFIFO

to TFIFO, single thread is responsible for moving the packet. However, during the

transfer, e.g, from RFIFO to DRAM or DRAM to TFIFO, the thread relinquishes

the microengine (context switched).

2.1.2 Motorola C-5 Processor

The Motorola C-5 network processor [26] comprises sixteen channel processors (CPs)

with four threads on each CP. Consecutive CPs can be connected in a pipelined

fashion using special registers. CPs can be assigned individually to a port or in an

aggregate mode. In addition to CPs there is an executive processor (XP) that serves

as a centralized computing resource for the C-5 and manages the system interfaces.

Three on-chip buses connect all these computational resources to external memories.

Three specialized units are part of memory controllers: table lookup unit (TLU),

which accelerates six different types of table lookup algorithms with 11 dedicated

instructions, buffer management unit that accelerates creation and destruction of

variable width buffers for payload data stored in SDRAM, and queue management

unit that accelerates creation and destruction of queues for packet descriptor data

stored in SRAM.

14 Background

2.1.3 IBM PowerNP Network Processor

The IBM PowerNP [13] has the following main components: embedded processor

complex (EPC), data flow (DF), scheduler, MACs, and coprocessors. The EPC

processors work with coprocessors to execute application software and PowerNP-

related management software. The coprocessors provide hardware-assist functions

for performing common operations such as table searches and packet alterations.

The DF serves as the primary data path for receiving and transmitting network

traffic. It provides an interface to multiple large data memories for buffering data

traffic as it flows through the network processor. The traffic management scheduler

allows traffic flows to be scheduled individually per their assigned QoS class for

differentiated services.

2.2 Network Applications

Network applications can be broadly classified, depending on the type of process-

ing, into two types: Header Processing Applications (HPA) and Payload Processing

Applications (PPA) [9]. The processing in HPA is independent of the packet size

and type of the packet payload. These applications involve a header field inter-

rogation, and table lookup. Examples include IPv4 forwarding and NAT. PPA

represent applications that access the entire packet, and the amount of processing

is dependent on the size of the packet. These applications typically involve an fen-

cryption/decryption on the entire packet. Examples include IP Security protocols.

We have selected two each from HPA and PPA in our study. The HPA pro-

grams chosen are IPv4 forwarding [2] and NAT [34], and the PPA programs used

are IP Security protocols: Authentication Header(AH) [21] and Encapsulation Se-

curity Payload(ESP) [22].

This section describes the different network applications used in our study. We

2.2 Network Applications 15

observe that all applications running on routers have similar flows. The application

buffers the incoming packets into DRAM, read packet/packet header depending

on the application, processes the packet depending on the application, writes the

packet/packet header back to the DRAM, and transfer the modified packet to the

transmit buffer.

2.2.1 IP Forwarding

IP forwarding is a fundamental operation performed by the router. We focus on

forwarding for IP version 4 packets [2]. IPv4 uses the IP header of the packet to

determine the destination address. A lookup is performed based on the destination

address in the IP to determine the destination port number and the next hop ad-

dress. The routing table is stored in the SRAM. Accordingly the packet header is

modified.This work uses a hash based lookup [29]. The time-to-live field in the IP

header is decremented, the cyclic redundancy checksum (CRC) is recomputed. The

packet is then forwarded to the next hop.

2.2.2 Network Address Translation

Network Address Translation or NAT [34] is a method by which many network

addresses and their TCP/UDP ports are translated into a single network address

and its TCP/UDP ports. We focus on NAT for TCP protocols. When a host in

the LAN, which is assigned a local IP address, initiates a TCP session through the

router to an external network, the router changes the source IP address field in

the IP header to the globally visible router IP address. In addition, a unique port

number is allocated by the router to the session. The port numbers, assigned by the

router, increase in steps of one and wrap around after 65536.

A tuple consisting of protocol name (TCP or UDP), source IP address, and the

source port number distinguishes a connection. The translation table stores the tuple

and the corresponding private IP address.The translation table is maintained by the

16 Background

router. It is used to route packets from the external network to the corresponding

local node. The translation table is typically stored in the SRAM.

2.2.3 IP Security

IPSec protocols [36] are used to provide privacy and authentication services at the IP

layer. The two protocols supported by IPSec are: Authentication Header(AH) [21]

and Encapsulation Security Payload (ESP) [22]. IP Authentication Header (AH) is

used to provide connectionless integrity and data origin authentication for IP data-

grams while the IP Encapsulation Security Payload (ESP) encrypts the TCP/UDP

segment in addition to the AH features.

IPSec protocols use a network handshake mechanism between the source and

the destination using a security association (SA) . The security Association is a 3

tuple containing security protocol (AH or ESP), source IP address and a 32 bit

connection identifier referred to as SPI (Security Parameter Index). SPI contains a

shared key used for encryption, and the encryption algorithm. After the handshake,

the requisite computation is done depending on the protocol. A digital signature

is computed over the packet payload. The key shared with the destination is used

in the signature computation. The AH/ESP header is placed after the IP layer

protocol but before the higher level protocols. In case of ESP, in addition to all the

processing needed in AH , the higher layer protocol is encrypted and placed after

the ESP header.

We assume that the SPI data, in particular the shared key is stored in SRAM.

2.3 Petri Nets: An Introduction

Petri net is a mathematical modeling tool which is commonly used to model con-

currency and conflicts in systems. A Petri net is a particular kind of directed graph,

together with an initial state called the initial marking. The underlying graph of a

2.3 Petri Nets: An Introduction 17

Petri net is a directed, weighted, bipartite graph consisting of two kinds of nodes,

called places and transitions, where arc is from a place to a transition or from a

transition to a place. Places of Petri nets usually represent conditions such as avail-

ability of resources in the system while transitions model the activities in the system.

A token in a place is interpreted as holding the truth of the condition associated

with the place. It can also indicate the availability of a resource. A place can be

associated with zero or more tokens. The number of tokens in each place in a Petri

net model represents the marking of the Petri net model. The marking of a model

represents the state of the system (the underlying system being modeled). A system

normally starts with an initial marking which is a representative of the initial state

of the system. A system moves from one state to another as the transitions in the

model fire, resulting in new markings.

A transition usually represents the occurence of an event. A transition has

a certain number of input and output places representing the pre-conditions and

post-conditions of the event. An input place p to a transition t has an incoming

arc (p, t) into the transition and an output place q has an outgoing arc (t, q) from

the transition. A transition can have 0 or more input places and 0 or more output

places. A transition fires (or an event is said to take place) only when all the input

places associated with the particular transition have at-least one tokens each. 1 The

firing time of a transition represents the delay associated with the occurence of the

event. A transition can be classified either as an instantaneous or timed transition.

Instantaneous transitions, represented by a thin line, represent events which take

zero times. Timed transitions take finite amount of time and are represented by

thick lines. A Petri net with both timed and instantaneous transitions is referred

to as a Stochastic Petri nets (SPNs). In case of timed transitions, the firing time

takes values from a firing function. For example, in case of timed transitions in

a Generalized Stochastic Petri nets (GSPNs) the firing function can either be a

constant or can take exponentially distributed values.

1Once the transition fires, the tokens that enabled the transition are removed from the respective

18 Background

Figure 2.4 Petri Net Example

0.1

CPU

READY

TERMINATE

CONTINUE

0.9

EXECEND

DECIDE

Figure 2.4 shows a simple Petri net model for modeling a simplistic round-robin

CPU scheduler.

The place READY represents the Ready queue with a fixed number of ready

tasks in the initial state. The place CPU with a token represents the availability

of the processor. The transition EXEC models the execution of a ready process

for a given scheduling quanta amount of time. In this example we assume that this

transition takes a fixed amount of time q. Thus each firing of EXEC removes a token

from the place CPU and READY, takes q amount of time and places a token each

in the output places - DECIDE and CPU. The DECIDE place models a conflict.

The output arc from this place to the transition CONTINUE has a probability of

0.9 while that to the transition END has a probability of 0.1. Accordingly, one of

these transitions are enabled. These transitions are instantaneous and place a token

in either of the output places-CPU or TERMINATE.

Petri nets are capable of modeling concurrency and conflicts. A conflict occurs

when a place is an input place to multiple transitions. Such conflicts are resolved

by assigning probabilities to each of the output arcs of the place or equivalently the

input arcs of the respective conflicting transitions. The probability associated with

the arc, and hence the respective transition represents the probability with which

the transition will be chosen from among the conflicting transitions, provided the

transition is otherwise ready. Concurrency is naturally modeled in Petri nets as

input places and required number of tokens are deposited to all the output places of the transition.

2.3 Petri Nets: An Introduction 19

multiple transitions that are ready can fire simultaneously.

Colored Petri nets [20] have the same modeling description power of classical

Petri nets, but are more concise from a graphical viewpoint. This conciseness is

achieved by merging analogous places in a model into a single place, and associating

colors to tokens, places, and transitions to distinguish among various elements. A

transition can fire with respect to each of its colors. By firing a transition, tokens are

updated in the normal way except that a functional dependency is specified between

the color of the transition firing and the color of the involved tokens.

Petri net models are generally used to analyse system and establish properties

such as liveness, reachability, safety, boundedness [27]. They can also be used for

performance evaluation either using analysis or using simulation [11, 33]. In this

thesis we use the latter approach.

Chapter 3

Performance Modeling and

Evaluation

3.1 Introduction

The architecture model of earlier works, [7] [38] [9], do not study the impact of NP

specific architectural features like hash units, crypto units, CRC units, FIFOs, and

multiprocessors. These have complex interactions among the subsystems and it is

necessary to model them appropriately to get more accurate performance measures.

Further, the DRAM is involved in all the stages of the packet flow in the network

processor. Hence, it is important to model the packet flow end-to end (i.e, including

the transfer from the MAC to RFIFO to DRAM and from DRAM to TFIFO to

MAC). The large number of DRAM accesses and the high latency involved in a

memory access suggest that DRAM can be a potential bottleneck. However earlier

work on network processors [29] [31] [9] assume that packets are already buffered

(resident) in the memory.

In this chapter, we develop a Petri net model for both the network processor

and flow of packets in the network processor. Unlike some of the earlier works on

3.2 A Single Microengine Petri Net Model 21

Petri net modeling for multithreaded processors [11] [33] which focused on model-

ing the processor architecture and performance model of network processor [31] [9]

our model models the architecture, application, and their interaction in great de-

tail. Hence each application-architecture is modeled as a separate Petri net. In the

following subsection we describe the Petri net model for a single microengine run-

ning the IPv4 forwarding algorithm and later provide extension for multiprocessors.

The model for other applications are developed in a similar manner. Our model is

validated using the Intel proprietary simulator [16] for different parameters and for

different applications. We use this model in the subsequent performance evaluation

and architecture exploration.

This chapter is organized as follows. The Petri net model of the IXP proces-

sor for different applications is presented in section 3.2. Section 3.3 describes the

simulation methodology. Section 3.4 provides a detailed performance analysis and

evaluation of the model.

3.2 A Single Microengine Petri Net Model

Figure 3.1 shows a part of the Petri net model for a single microengine running

the IPv4 application. For clarity, only a part of the model which captures the

flow of packets from the external link to DRAM through the MAC is shown. The

firing time of a timed transition in our model takes either deterministic or expo-

nentially distributed values. In the following description, words in italics represent

places/transitions.

The place INPUT-LINE represents the external link. Packets arrive at IMAC,

the input MAC, from the external link at line speed 1. If an input port (IPORT)

1The line speed corresponds to 2.5 Gbps or higher

22 Performance Modeling and Evaluation

Figure 3.1 Petri Net Model for a Single Microengine in IXP 2400 Running IPv4Application

DRAM

UE

RFIFO

THREAD

RMACMEM

IMAC

LINE_RATE

IPORT

INPUT−LINE

MAC−FIFO

UE−RFIFO

UE−PROCESSING

SWAP−OUT

MEM−R1

UE−CMD−Q

DRAM−Q

WAITCMDBUS1

CMD−BUS

MV−DQ

MEM−R1

RFIFO−DRAM

DRAM−XFER


in the MAC is free and if there is sufficient MAC memory, i.e., at least a token in

IMAC, the packet gets buffered in MAC. A token in RMACMEM indicates that

a packet has been buffered in the MAC. If a thread is free, denoted by a token in

place THREAD, it takes control of the packet and transfers the packet to the receive

buffer (RFIFO). The initial marking of place THREAD denotes the total number of

threads in a microengine. If the microengine is free, represented by a token in place

UE, the thread executes for UE-PROCESSING amount of clock cycles, and moves

the packet from RFIFO to DRAM. The thread swaps out, denoted by the arc from

SWAP-OUT to UE, after initiating a memory transaction by placing the request for

memory access in the microengine command queue (UE-CMD-Q). The availability

of a free entry in the command queue is denoted by a token in the place UE-CMD-

Q. The memory request is then moved from UE-CMD-Q to DRAM-Q through the

command bus arbiter (CMD-BUS). We defer the discussion on modeling memory

access to section 3.2.3. The memory request gets processed by DRAM and a token

is placed in DRAM-XFER indicating the completion of the memory operation.

The places UE, DRAM, CMD-BUS represent conflicts, i.e., two events competing

for a common resource. Conflicts are resolved by assigning probabilities to the con-

flicting events. Our Petri net model assigns equal probabilities for accessing shared

resources. We defer the discussion on modeling memory access to section 3.2.2.

The transitions, MAC-FIFO and RFIFO-DRAM represent packet flow from

MAC to RFIFO, and RFIFO to DRAM respectively. These transitions also represent

a part of the packet flow in the network processor. The places UE, THREAD, UE-

CMD-Q, DRAM-Q, CMD-BUS represent various resources in the architecture and

hence model the processor architecture. The timed transitions UE-PROCESSING

and RFIFO-DRAM represent the specific tasks and the time taken by these transi-

tions models the time taken by the corresponding tasks in the specific unit. Thus

the Petri net model is able to capture the processor architecture, applications, and

their interaction in detail.


3.2.1 Multiple Microengine Petri Net Model

In network applications, each microengine processes packets which are independent

of packets processed by other microengines [7]. The processing done by each mi-

croengine (described in the earlier subsection) is represented by a colour. We use

coloured Petri nets for modeling multiple microengines. The number of microengines

is represented by number of initial tokens, of different colours, in the place UE .

The NP being a store-forward architecture, the processor memory interaction

is very critical. The following subsection models the memory accesses in the IXP

processor.

3.2.2 Memory Modeling

Figure 3.2 and 3.3 show the detailed Petri net model of DRAM accesses in IXP

chip. DRAM memory in IXP has four banks. There exists a difference in memory

architecture between 2400 and 28XX processors.

IXP 2400 supports DDR-DRAMs while the 28XX series provides support only

for Rambus DRAM. The Rambus DRAMs differ from DDR-DRAMs in that they

support pipelined memory accesses [18]. Our Petri net model is able to model both

these type of DRAMs. Rest of this subsection describes the Petri net modeling of

memory accesses in DDR-DRAM and Rambus DRAM.

Figure 3.2 shows the Petri net model for memory accesses to a DDR-DRAM. A

token in DRAM indicates that the DRAM is free for memory access. We give an

initial marking of 4 for the place DRAM to represent four available DRAM banks.

We say that a bank conflict arises when two memory accesses are attempting

access to the same bank. Figure 3.2 models the bank conflict as follows. A token

is placed in MEMR2 for accessing the DRAM. The token is either returned back


Figure 3.2 Petri Net model for Memory Access in DDR DRAM.

MEMR2

P11−P1

BANK_CONFLICT

DRAMUE−DRAM

WAITMEM2

WAITMEM1

to MEMR2 after BANK-CONFLICT clock cycles with a probability (1-p1) or the

token is placed in WAITMEM1 with a probability p1. Thus when a memory request

is not getting processed immediately, it is made to wait for BANK-CONFLICT clock

cycles before accessing the DRAM. The BANK-CONFLICT time is chosen as the

average DRAM memory access time for a packet. So in this model, memory accesses

go to different memory banks with probability (p1).

Figure 3.3 shows the Petri net model for memory accesses to a Rambus DRAM.

Memory accesses to a Rambus DRAM are pipelined. We assume four pipe stages,

represented by places PIPESTAGE1, PIPESTAGE2, PIPESTAGE3, PIPESTAGE4.

Bank conflicts in each pipe stage incur a PIPE-CONFLICT penalty. The pipe con-

flict time is roughly one fourth of the bank conflict time. Hence the bank conflicts

in Rambus-DRAMs incur lesser penalty.


Figure 3.3 Petri Net model for Memory Access in Rambus DRAM.

MEMR2

PIPE CONFILCT

WAITP1

PIPE CONFILCT

PIPE STAGE1

WAITP2

PIPE STAGE2

WAITPIPE1

PIPE CONFILCT

WAITP3

PIPE STAGE3

WAITPIPE2

PIPE CONFILCT

WAITP4

PIPE STAGE4

WAITPIPE3

MEMCOMP

PIPE1

PIPE2

PIPE3

PIPE4

3.3 Performance Evaluation of IXP

In this section we present the performance evaluation results of the NP running

different applications. We use IPv4, NAT, IPSec applications as benchmarks. The

applications are described in section 2.3.

3.3 Performance Evaluation of IXP 27

3.3.1 Simulation Methodology

We have developed Petri net models for IPv4 and NAT running on IXP 2400 and

IPSec protocols (AH and ESP) running on IXP 28502. The PN model for each ap-

plication is simulated using CNET [40]. CNET is an event-driven simulator which

simulates our timed petri net model. The simulator maintains a queue called the

event-queue. 3 This list is ordered with respect to the time in which the events

are scheduled to occur., with the event scheduled to occur in the nearest future at

the head of the list. The simulation time is advanced according to the events that

are to be executed in the event-queue. Each event triggers another set of events

to be enqueued in the event-queue. The simulation stops either when there are no

events in the event-queue or when the simulation time is exceeded. In our case, the

simulation is run for 10e8 microengine clock-cycles. The simulator outputs the fol-

lowing performance metrics- total number of tokens in a place, time average number

of tokens in a place, minimum and maximum number of tokens, at any given time

instant, in a place. We use these metrics in the following performance analysis.

The simulations were performed for different number of microengine/thread con-

figurations for a total of 16 configurations. We use the notation 2X8 to a config-

uration with 2 microengines and each microengine executing 8 threads. We model

packet arrivals as Poisson packet arrivals [6] with a mean λ, where λ is the mean

number of packet arrivals. In our study we assume a line rate of 6 Gbps and a fixed

packet size of 64 bytes4. Further, we assume that to access 8B of data, the DRAM

and SRAM take 50 nano-seconds and 8 nano-seconds respectively [18]. However to

access larger chunks of data (like 64 B) in DRAM which are in contiguous memory

locations, only an additional 5 nanosecond per 8B is required [12].

2IXP 2400 does not provide specific support for cryptographic applications and the crypto unitis only present in 2400 and 2850.

3The event in this description refers to a transition.4For these parameters λ is 0.24 micro-seconds.


Tasks IPv4 AH NAT ESPTotal thread proc 120 300 123 350Hash proc 85 - 85 -Crypto proc - 75 - 160Each FIFO MAC transfer 32 32 32 32

Table 3.1: Model parameters used in the Petri net model

Table 3.1 provides the model parameters used in our Petri net model. Note that

these parameters are on a per thread basis and are given in terms of the number of

processor clock cycles consumed.

We make the following assumptions in our simulation. The packet sizes are

assumed to be constant and of minimum size (64 B), this assumption is used for

worst case scenarios in case of DoS attacks. This scenario corresponds to the worst

performance of IXP processors for various applications. Other performance studies

[7] also evaluate network processors under similar conditions. In case of NAT, we

assume a constant session size of 10 kilobytes. We also assume that packets from

the external link and packets from local network arrive in the NP from mutually

exclusive ports.

In order to validate the Petri net results we have implemented all the applications

in MicroengineC [17], a high level programming language for Intel network proces-

sors and simulated on Intel SDK 3.51 [16]. This simulator is an instruction-level

simulator for the IXP chip developed by Intel corporation and is used to simulate

the IXP chip.

3.3.2 Validation Results

In the following subsection, first we provide a validation for the Petri net simulation

results. In the subsequent subsections, we use the Petri net approach for the per-

formance study of the network applications on the base IXP architectures as well as


for architecture explorations.

The following performance parameters have been measured from the SDK simu-

lation and the PN simulation. We use these parameters to compare the results from

the PN model and the Intel SDK simulator.

• Throughput : The throughput of the NP, measured in Gigabits per second,

represents the aggregated (from all ports) number of packets transmitted.

• Microengine Utilization : This parameter gives the average utilization where

the average is measured as a time-average [3]. The utilization metric mea-

sured from the SDK simulation includes the time the microengine is executing,

aborted, and stalled. Execution is stalled when the microengine command

queue is full (4 entries) and the executing thread does not swap out.

• Microengine command queue length : This parameter gives the time averaged

command queue length of a single microengine. Note that the command queue

queues all requests for DRAM, SRAM and hash from a microengine queue as

described in section 2.2.1.

• DRAM Queue Length: This metric is the time averaged queue length of the

DRAM queue. The DRAM queue stores the requests from all MEs, waiting

to be serviced by the DRAM.

• Microengine Stall Percentage: This metric gives the percentage of time a

thread in a microengine is stalled.

In this subsection we initially analyze the results for header processing applica-

tions (IPv4 and NAT) and later for payload processing applications (AH,ESP). The

results presented have been arranged in the increasing order of the total number of

threads.


Figure 3.4 shows the transmit rates obtained from Petri net and SDK simula-

tions for all applications. In the Petri net simulation, we used different bank conflict

probabilities. For all applications we observe that the transmit rates obtained from

the Petri net simulations follow a similar trend to the SDK simulation. In particular,

the throughput rates from the SDK simulation closely follow that of the Petri net

simulation for bank conflict probabilities 0.5 and 0.7. Even though the variation for

other bank conflict probabilities are somewhat higher, the Petri net simulation is

able to predict the trend in general well.

Figure 3.4 Transmit Rates from PN and SDK Simulations.

0

1

2

3

4

5

1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 4x8 8x4 8x8

Tx

RA

TE

(G

bps)

Micro engine- Number of threads

IPv4 SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9

(a) IP4

0

1

2

3

4

5


Tx

RA

TE

(G

bps)


NAT SDK RESULTCNET BANK PROB = 0.3CNET BANK PROB = 0.5CNET BANK PROB = 0.7CNET BANK PROB = 0.9

(b) NAT

0

1

2

3

4

5


Tx

RA

TE

(G

bps)


IPSEC AH SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9

(c) AH

0

1

2

3

4

5

6


Tx

Rat

e (G

bps)


IPSEC SDK RESULTCNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9

(d) ESP

In Figure 3.5, we compare the average utilization of microengine as observed

from the Petri net and SDK simulations. Once again these values closely match and

follow the same trend. These results essentially validate our Petri net model and


Figure 3.5 Microengine Utilization from PN and SDK Simulations.

0

20

40

60

80

100

120


PE

R M

E U

TIL

IZA

TIO

N %


IPv4 SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9

(a) IP4

0

20

40

60

80

100

120


PE

R M

E U

TIL

IZA

TIO

N %


NAT SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9

(b) NAT

0

20

40

60

80

100


PE

R M

E U

TIL

IZA

TIO

N %


IPSEC AH SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9

(c) AH

0

20

40

60

80

100

1-1 1-2 2-1 1-4 2-2 4-1 1-8 2-4 4-2 8-1 2-8 4-4 8-2 4-8 8-4 8-8

PE

R M

E U

TIL

IZA

TIO

N %


IPSEC ESP SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9

(d) ESP

their performance results obtained from them.

3.3.3 Throughput

First we report the impact of multithreading and multiple micro-engines have on

the transmit rates achieved by various applications. The performance results are

obtained by simulating our detailed Petri net model.

Figure 3.4 shows the transmit rates for all applications. We observe that as

the total number of threads increases, the throughput increases and reaches 3 Gbps

for header processing applications (IPV4 and NAT) and nearly 4 Gbps for payload

processing applications (AH and ESP). These correspond to OC-48 and higher line

rates. The reason for higher throughput in PPA application as compared to the


Figure 3.6 DRAM utilization for Different Bank Probabilities.

0

20

40

60

80

100

120


DR

AM

UT

ILIZ

AT

ION

%


CNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9

(a) IP4

0

20

40

60

80

100

120


DR

AM

UT

ILIZ

AT

ION

%



(b) NAT

0

20

40

60

80

100

120


DR

AM

UT

ILIZ

AT

ION

%



(c) AH

0

20

40

60

80

100

120


DR

AM

UT

ILIZ

AT

ION

%



(d) ESP

HPA application is that the PPA applications are computationally more intensive

and hence result in higher microengine utilization (refer to Figure 3.5).

Also we observe that the throughput saturates beyond a total of 16 threads which

occurs due to a high DRAM utilization.

DRAM Q length Stall %Config IP4 NAT AH ESP IP4 NAT AH ESP

4X8 10.9 9.7 4.7 5.08 29.5 11.5 13.8 22.48X8 10.8 8.3 9.2 9.63 76.8 65 31.96 61.5

Table 3.2: Time Average DRAM Queue Length and Stall Percentage.

We further observe that the throughput drops in SDK simulations in case of IPv4


for 4X4 configurations and beyond. This is due to the negative feedback mechanism,

as discussed in section 2.2.1, which arises when the average DRAM queue length is

greater than 10 (10.81 for IPv4). We observe that the DRAM queue length for IPv4

is higher than 10. This may cause the throughput to saturate beyond 16 threads.

This results in a higher percentage of stalls (which is much higher for IPv4) resulting

in a reduced throughput.

Figure 3.6 plots the DRAM utilization for different applications and different

number of microengines and threads. The DRAM utilization in PPA applications

are lower (less than 60%) for 16 or more threads5. In PPA applications, certain

memory accesses such as packet header, only the first pipeline stage in the DRAM

is used. Hence the DRAM utilization is lower. Recall that these applications are

executed in the IXP 2850 which has Rambus memory. As discussed in section 3.2.2,

the Rambus DRAM is pipelined into 4 stages. Further we note that the through-

put rate for payload processing application is higher by 33% compared to that for

header processing applications. This is because of faster accesses in Rambus DRAM

which helps in supporting higher number of memory requests and hence the higher

throughput.

Figure 3.5 and Figure 3.7 show the microengine utilization and the average mi-

croengine command queue length respectively on a per microengine basis. (Recall

that the microengine-command-queue queues all requests for accesses to DRAM,

SRAM, Hash.) Both these parameters follow a triangular pattern for all the appli-

cations. This can be explained as follows. In a 1X8 configuration, all the 8 threads

execute on the same microengine whereas in a 8X1 the eight threads execute on dif-

ferent microengines. This leads to a higher microengine utilization for 1X8, nearly

60% for IPv4. In comparison in a 8X1 configuration the utilization is only 10% for

IPv4.

5DRAM utilization for Rambus DRAM is calculated as the average utilization for all the fourpipe stages.


Figure 3.7 Average Microengine Queue Length for Different Bank Probabilities.

0

1

2

3

4

5

6


AV

ER

AG

E M

E Q

UE

UE

LE

NG

TH


SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9

(a) IP4

0

1

2

3

4

5

6


AV

ER

AG

E M

E Q

UE

UE

LE

NG

TH



(b) NAT

0

1

2

3

4

5

6


AV

ER

AG

E M

E Q

UE

UE

LE

NG

TH



(c) AH

0

1

2

3

4

5

6


AV

ER

AG

E M

E Q

UE

UE

LE

NG

TH


IPSEC ESP SDK RESULTCNET RESULT BANK PROB=0.3CNET RESULT BANK PROB=0.5CNET RESULT BANK PROB=0.7CNET RESULT BANK PROB=0.9

(d) ESP


3.3.4 Architecture Exploration

The main advantage of the PN model over the Intel SDK simulator is the relative

ease with which new architectural features can be evaluated. Further while SDK

simulation takes several hours to simulate a single configuration, the PN simulations

takes only 1 hour. Having validated the Petri net approach using the SDK simulator,

we can now use the former for evaluating the performance of a few enhancements

that we propose to improve the throughput of the network processor.

We explore the memory architecture only for header processing applications since

their performance is memory limited by the DRAM utilization.

3.3.4.1 Impact of DRAM Banks and Hash Units

The validation results in Section 3.4.3 indicate that DRAM limits the throughput

significantly in IPv4 and NAT. Hence a larger number of DRAM banks can be

beneficial. Since the number of banks is typically in powers of 2 we consider increas-

ing the DRAM banks to 8. Since DRAM is off chip, pin count is a constraint in

increasing the number of banks.

To keep the pin count same in the IXP processor we assume the width of DRAM

channel to be same as in the base IXP processor, and accordingly model the channel.

Note that DRAM banking still can be beneficial as the maximum number of parallel

accesses to the DRAM is increased to 8. In Figure 3.8 we plot the impact of

increasing the number of memory banks.

The performance results in Figure 3.8 indicate an improvement in throughput

by up to 20% (3.6 Gbps) with respect to the base case. In particular, the perfor-

mance improvement increases when the number of threads increase from 8 to 16 (for

configurations like 2X8 and 4X4). Further we observe that the DRAM utilization

decreases from 90% to 60%. As the DRAM utilization reduces by up to 40%, the

utilization of the hash unit increases to more than 90% and it becomes the bottle-

neck.


Figure 3.8 Impact of Number of DRAM Banks.

0

1

2

3

4

5

6

7

8


Tx

RA

TE

(G

bps)

Micro engine x Number of threads

4 BANK, PROB=0.5 4 BANK, PROB=0.9 8 BANK, PROB=0.5 8 BANK, PROB=0.9

(a) Transmit rate

0

20

40

60

80

100

120


DR

AM

UT

ILIZ

AT

ION

%



(b) DRAM Utilization

0

20

40

60

80

100

120


HA

SH

UT

ILIZ

AT

ION

%



(c) HASH Utilization

Next we evaluate the impact of increasing the number of hash units. We consider

a NP with 2 hash units. We obtain a throughput of 4.8 Gbps (shown in Figure 3.8(a))

an improvement of up to 60% in comparison with the base IXP architecture. Further,

we observe that the transmit rate does not increase beyond 4 microengines, especially

for configurations such as 8X2, 8X4. Also note that with 2 hash units the utilization

of hash units decreases to 60% and the DRAM utilization also remains around 60%.

So an IXP architecture with only 4 microengines and 2 hash units gives a significant

throughput improvement (66%) but consumes almost the same area as the base

IXP processor. This is based on the area estimates given in [9] where a hash unit

consumes almost the same area as four microengines. So we believe that future

network processor architecture will need to scale special processing units like hash


Figure 3.9 Impact of Number of Hash Units.

0

1

2

3

4

5

6

7

8


Tx

RA

TE

(G

bps)


8 BANK, PROB=0.5 8 BANK, PROB=0.9

8 BANK + 2 HASH , PROB=0.5 8 BANK + 2 HASH, PROB=0.9

(a) Transmit rate

0

20

40

60

80

100

120


DR

AM

UT

ILIZ

AT

ION

%





0

20

40

60

80

100

120


HA

SH

UT

ILIZ

AT

ION

%




(c) HASH Utilization

units to support higher line rates.

3.3.4.2 Better Utilization of SRAM

The performance results for HPA indicate that DRAM is saturated beyond 16

threads whereas the SRAM utilization is only 27 % (refer to Figure 3.10(c), 3.11(c)).

Further the memory access time for DRAM is at least 5 times greater than a similar

SRAM access. In order to better utilize the SRAM and improve the packet through-

put we consider placing the packet header, fixed length of 20 Bytes, in SRAM and

the packet payload in DRAM.

The performance results for this scheme are shown in Figure 3.10 for IPv4 and

Figure 3.11 for NAT. This results in a performance improvement of up to 20% (Fig-

ure 3.10) in case of IPv4 and 6% in case of NAT (Figure 3.11). The performance


Figure 3.10 Performance Enhancements from Storing Packet Header in SRAM forIP4

0

1

2

3

4

5


Tx

RA

TE

(G

bps)



PKT HDR IN SRAM , BANK PROB=0.5PKT HDR IN SRAM, BANK PROB=0.9

(a) Transmit rate

0

20

40

60

80

100

120


DR

AM

UT

ILIZ

AT

ION

%





0

20

40

60

80

100

120


SR

AM

UT

ILIZ

AT

ION

%




(c) SRAM Utilization

improvement is due to the lesser memory access time for SRAM as compared to

DRAM and the reduction in contention for accessing the DRAM. However, the per-

formance saturates beyond 16 threads, as the SRAM utilization increases to greater

than 90%. It is interesting to note that while the IP4 forwarding application gives a

throughput improvement of 20% the NAT gives an improvement of around 6% (refer

Figure 3.11). This occurs due to the larger number of SRAM accesses involved in

NAT, since the translation table is stored in SRAM. A question that arises with this

approach is the buffering space in SRAM, as the SRAM size is typically limited to

8 MB or 16 MB and it stores state information like lookup table or NAT table.

However, even with an 8 MB SRAM and leaving 2 MB for lookup table and

other state information, we can still store as many as (6 MB/20 B = 300,000) packet

headers in SRAM. Hence the buffering space in SRAM is not really a concern. This


Figure 3.11 Performance Enhancements from Storing Packet Header in SRAM forNAT

0

1

2

3

4

5


Tx

RA

TE

(G

bps)




(a) Transmit rate

0

20

40

60

80

100

120


DR

AM

UT

ILIZ

AT

ION

%





0

20

40

60

80

100

120


SR

AM

UT

ILIZ

AT

ION

%




(c) SRAM Utilization

scheme is particularly attractive since a significant performance improvement can

be achieved without any additional hardware overhead. This also indicates alterna-

tive ways of buffering packet headers in existing on chip memory, like the scratch

pad and local memory, can give significant performance improvement without any

additional cost.

3.3.4.3 Limiting the Number of Pending DRAM Requests

In IPv4, we observe that the DRAM queue length to be greater than 10, and stalls

account for 75% of microengine utilization (refer to Table 3.2).


Whenever the DRAM queue length exceeds 10, the feedback mechanism (dis-

cussed in Section 2.2.1) prevents further issue of DRAM accesses from the micro-

engine command queue to the DRAM command queue. This also results in blocking

of other requests such as SRAM requests or hash requests. To alleviate this, we limit

the number of pending DRAM requests from each microengine. This allows the ex-

ecution of ready threads as well as prevents blocking of other accesses from each

microengine command queue.

Figure 3.12 Impact of Limiting Pending DRAM Accesses per Microengine

0

1

2

3

4

5

1x4 1x8 2x4 2x8 4x4 4x8 8x4 8x8

Tx

RA

TE

(G

bps)


IPv4 RESULT COUNT =1IPv4 RESULT COUNT =2IPv4 RESULT COUNT =3

IPv4 RESULT NO COUNTER

(a) Transmit rate

In Figure 3.12 we plot the transmit rate under various configuration when the

number of pending DRAM requests (COUNT) per microengine is limited to 1, 2 or

3. Limiting the pending DRAM requests to 2 or 3, increases the throughput by up

to 47% compared to the base case. Note that the throughput obtained in this case

is also the the maximum throughput obtained for IPv4 (refer to Figure 3.4).

3.3.5 Summary

This chapter develops a Petri Net model for a commercial network processor (Intel

IXP 2400 and 2850) for different applications. The PN model is developed for three

different applications viz IPv4, NAT and IPSec protocols and validated using the


Intel proprietary SDK simulator. The model is validated for different processor pa-

rameters like processor utilization, queue length, and transmit rate. This validation

is done across different thread configurations. The salient feature of our model is

its ability to capture the architecture, applications and their interaction in great

detail. Our performance results show that while multithreading helps to improve

the throughput, increasing the total number of threads beyond a certain point (16

threads for HPA and 32 threads for PPA) results in performance saturation. Since

the transmit rate is limited by the packet buffer memory utilization, we investigate

different approaches to reduce the memory utilization. Our performance results

indicate that :

• In the IXP processor, although the DRAM is utilized 100% the SRAM is

utilized only upto 27%, hence we explore placing the packet header in SRAM

and packet payload in DRAM. This gives an improvement in transmit rate by

upto 20%. This scheme is particularly attractive since it does not involve any

additional hardware and there exists sufficient space in the SRAM to buffer

packet headers.

• Increasing the number of DRAM banks from 4 to 8 improves the throughput

by upto 20%. However when the number of banks is 8, the hash unit, a task

specific unit used for performing hardware lookup, becomes the bottleneck.

Increasing the number of hash units from 1 to 2 gives an improvement in the

throughput by upto 60% as compared to the base case. We further observe

that an identical improvement is obtained by using two hash units but lesser

number of microengines (4 MEs). So given a fixed die area, a NP architecture

with lesser number of processors but more task specific units, with respect to

the base IXP architecture, gives a better performance.

• When the number of outstanding memory requests in the IXP processor ex-

ceeds a threshold then all microengines with memory requests at the head of

their command FIFO are stalled. Instead if the number of pending memory


requests from each microengine is limited then an improvement in transmit

rate by up to 47% compared to the base case can be achieved.

Chapter 4

Packet Reordering in Network

Processors

4.1 Introduction

NPs employ multiple parallel processors (microengines) to exploit packet level par-

allelism inherent in network workloads in order to support OC 48 line rates (using

IXP 2400) as reported in the previous chapter. Each microengine processes packets

independent of the processing of other microengines. Since packets can get allocated

to threads in different microengines, packet order at the output of the NP cannot

be guaranteed. Earlier works [5] [24] study the impact of packet reordering on

the TCP throughput in routers. However they do not consider the impact of NP

architecture on reordering. This chapter studies the impact of network processor

architecture on packet reordering and packet retransmission. We extend the Petri

net model developed in the previous chapter to evaluate the impact of reordering in

network processors.

This chapter is organized as follows. In the following section we describe packet

reordering in the IXP architecture. Section 4.3 presents the performance results.

44 Packet Reordering in Network Processors

Section 4.4 describes different ways to reduce packet reordering. Section 4.5 sum-

marizes this chapter.

4.2 Packet Reordering

When packets belonging to a single flow, having the same source and destination

IP address and port number, arrive at the destination in an order different from the

sequence order, we say that the packets are reordered. Packet reordering is a well

known phenomenon in the Internet [4] [5].

Figure 4.1 Packet Reordering in Network Processors.

ME2

T4

P14

T3

P10

T1

P2

T2

P6

ME3

P15

T3 T4

P11

T2

P7

T1

P3

ME4

P16

T3

P12

T2

P8

T1

P4

T4

ME1

T1 T3 T4

P13P9

T2

P5P 1

P3 P5P2 P4

TFIFO

PACKET ARRIVAL

RFIFO

P1P2P3P4P5...P1...

Studies on backbone traffic measurement [8] suggest that TCP accounts for

80% of the Internet traffic. When packets get reordered, the TCP receiver begins to

generate duplicate ACKs. On receiving duplicate ACKs, the TCP sender concludes

that packet drops have occurred due to congestion. The Congestion Avoidance

algorithm [30] now kicks in and reduces the congestion window to roughly half

its current value. We explain the effect of reordering with the following scenario.

4.2 Packet Reordering 45

Assume that packets P0, P1, P2, P3, P4, P5 are packets of the same flow being sent

by A (sender) to B (receiver). The sender transmits packets P0, P1, P2, P3, P4, and

P5 strictly in that order. But, due to network delays/router processing, B receives

packets in the order, P0, P3, P4, P5, P2, P1. When B receives P3, instead of P1, it

sends an ACK for the last in-order packet received, i.e., in this case, the ACK for P0.

B continues to send ACKs for P0 when it receives P4, P5 and P2. When the sender

(A ) sees 3 duplicate ACKs for packet P0, it concludes that the network is congested,

and according to the Congestion Avoidance and Fast Retransmit algorithms [30], it

halves the transmit window size. As a result, the TCP sender transmits fewer packets

than what the network can actually accommodate. Thus the effect of reordering

is not only the retransmission of packets that are already transmitted but also an

unnecessary reduction of the sender’s congestion window leading to under-utilization

of the network resources. The following subsection explains the architecture impact

of network processor on packet reordering.

4.2.1 Reordering in Network Processors

A network processor, being a multithreaded multiprocessor, can process packets of

the same flow in different microengines and different threads. This may result in

packets getting forwarded in an order different from the transmitted order.

Consider the scenario shown in Figure 4.1. Assume packets P1, P2, P3, P4 of

the same flow arrive at the receive buffer (RFIFO) of the network processor in order.

Let the packets be allocated to threads in different microengines in the following way

: P1, P2, P3, P4 are allocated to ME1-T1 (Microengine1-Thread1), ME2-T1, ME3-

T1 and ME4-T1 respectively. Now packet P1, being processed by ME1-T1, can get

delayed with respect to P2, P3, P4. This can happen due to various reasons, e.g.,

processing of other threads in ME1, or pending memory requests in DRAM FIFO.

So the thread ME1-T1 completes the processing of P1 only after ME2-T1, ME3-T1,

and ME4-T1 have processed their respective packets. So packet P1 is delayed with

respect to P2, P3, P4 and is transmitted only after P2, P3, P4 have been forwarded.


This may result in a retransmission of P1 when multiple duplicate ACKS for P0

are received by the sender. This example explains how the concurrent processing

of packets can affect the ordering of packets. Note that multiple microengines is a

feature of network processors and a network processor such as the IXP 2400 [15]

has 64 threads and hence can process up to a total of 64 packets. This potentially

increases the chances of packet reordering.

4.2.2 Transmit Buffer Induced Reordering

In this subsection we explain the impact of the transmit buffer on packet reordering.

Transmit buffer is a shared resource in the IXP architecture. So all the threads

compete for a common transmit buffer space.

Figure 4.2 Transmit Buffer Reordering.

.

.

.

ME 1− T 1

ME2 − T2

ME8 − T8

ME1 − T 2

ME1− T4

ME1 − T5

ME1 − T6

ME1 − T8

ME2 − T1

TFIFO3

TFIFO4

TFIFO2

TFIFO5

TFIFO6

TFIFO7

TFIFO8

P1

TFIFO9

TFIFO10

TRANSMIT BUFFER

HEAD

TAIL

TFIFO1

ME1 − T3

ME8 − T1

P2

Hence, to ensure proper access of the transmit buffer, all threads should execute

a mutual exclusion operation. This, as reported in Section 3.6, results in a significant

drop in the throughput (61% drop in the transmit rate). So transmit buffer locations

are allocated a-priori to different threads. However, the transmit buffer dequeues

packets in a strict FIFO order. This aggravates packet reordering as illustrated in

the following example.

We consider a contiguous buffer allocation where different threads in different

microengines are allocated contiguous space in the transmit buffer. More specifically,

we will assume that ME1-T1 (Microengine1- Thread1) is allocated the first 64 bytes,

ME1-T2 is allocated the next 64 bytes and so on (refer to Figure 4.2). Assume


packets P1, P2, P3, P4 from flow F1 arrive strictly in that order in the receive

buffer. Further, assume that P1, P2, P3, P4 are allocated to ME1-T1, ME2-T1,

ME3-T1, and ME4-T1 respectively. After processing by different microengines, the

packets P1, P2, P3, and P4 are stored in TFIFO1, TFIFO9, TFIFO17 and TFIFO25

respectively. However, as mentioned earlier, packets are dequeued in a strict order

of the transmit buffer location. Thus, before P2 is dequeued from TFIFO9 location,

other packets from TFIFO2 to TFIFO8 will be dequeued. If packets from the same

flow as P2 are allocated to threads in microengine 1, they will get forwarded before

P2, causing the packet reordering problem.

Hence, the transmit buffer can independently induce reordering. Note that in

this example even if packets P1, P2, P3, and P4 complete in the same order, the

dequeuing of packets from the transmit buffer causes reordering. This indicates that

the transmit buffer can independently cause packet reordering. We explore different

transmit buffer schemes and study their effect on reordering.

4.2.3 Packet Ordering Mechanisms in IXP

The IXP processor supports the following two mechanisms to maintain packet order

in the network processor.

• Inter Thread Signaling (ITS): In this mechanism the start and finish tasks

of IPv4 forwarding are executed sequentially. However, the packet processing

functions are done parallely and independently across all microengines (refer

to Figure 4.3). The task of writing packets from DRAM to TFIFO also

takes place sequentially. Each thread wits for a signal from previous thread

before they can transfer packets to DRAM. Once the packets are transferred

to TFIFO the next thread is signaled. Thus the sequential processing at the

beginning and at the end of IPv4 ensures that packets are allocated in the

transmit buffer and transmitted out in-order.

In this scheme each thread is allocated a packet in sequential order. Assume


Figure 4.3 Inter Thread Signaling in the IXP.

PKT PROC

PKT PROC

PKT PROC

PKT PROC

PKT PROC

PKT PROC

ME8 − T8

ME1 − T4

ME1 − T2

ME1 − T3

ME1 − T2

ME1 − T1

...ME1 − T2

PKTALLOC

ME1 − T1

PKTALLOC

SERIAL EXECUTION

TIME (t)

PARALLEL EXECUTION

...ME1 − T2ME1 − T1

SERIAL EXECUTION

PKT−Tx PKT−Tx

that packets P1 and P2 arrive in the system in that order. An implicit logi-

cal ordering of the threads across all microengines specially the order ME-T1,

ME1-T2,...ME1-T8, ME2-T1, ..., ME8-T8 is assumed. Further, ME1-T1 is

assigned P1 and ME1-T2 is assigned P2. This assignment occurs in a sequen-

tial order, across all the threads in the processor. This ordering of threads

is done using Inter Thread Signaling (ITS). Each threads waits for a signal

to start the sequential task, performs the allocation of packet or transmission

of packet, and signals the neighboring thread. For example, ME1-T1 signals

ME1-T2 and ME1-T8 signals ME2-T1 as depicted in Figure 4.3.

• Asynchronous Insert Synchronous Remove: In this scheme, packet forwarding

is divided into four stages namely, packet buffering stage, packet processing

stage, reordering stage, and transmit stage. In the initial stage namely the

packet buffering stage, every packet is assigned a sequence number and buffered

in the memory (DRAM).


Figure 4.4 Asynchronous Insert Synchronous Reset (AISR) in the IXP.

TRANSMIT STAGE

PKT TX

ME6− T1

ME6− T2

ME6− T3

.

.

.

ME8− T8

TIME (t)

PACKET PROCESSING STAGE

PKT PROC

PKT PROC

PKT PROC

PKT PROC

PKT PROC

PKT PROC

ME4− T8

ME2 − T4

ME2 − T4

ME2− T3

ME2− T2

ME2− T1

...ME1 − T2

PKTALLOC

ME1 − T1

PKTALLOC

PACKET RX STAGE

PKTALLOC

ME1 − T8ME5 − T1

SYNC INSERT SYNC INSERT

ME5 − T8

REORDERING STAGE

The sequence number is maintained for all the packets arriving in the system.

The packet sequencing is done by a single microengine and eight threads in that

microengine (refer to Figure 4.4). The sequence number of a newly arriving

packet in the system is one greater than the previous packet. After the packets

are assigned sequence numbers, the packet processing stage processes packets

independently and passes the packet handler to the next stage, the reordering

stage. The packet processing stage is executed parallely by 4 microengines

(32 threads). The reordering stage performs a counting sort of the packet

handler is carried out by the reordering block to restore packet ordering. Here

the packets are also assigned different transmit buffer addresses. A single

microengine performs the counting sort. The transmit buffer address is passed

on to the last stage, the transmit stage. The transmit block, moves the packet

out of the DRAM to the network interfaces. 3 microengines (24 threads) are

involved in this final stage.


4.2.4 Performance Metric

The extent of packet reordering is measured using two performance metrics, namely

packet reordering and retransmission rates. Reordering is measured as the number

of duplicate ACKs that will be sent by the destination back to the source. Re-

transmission corresponds to the number of retransmission packets where 3 or more

duplicate ACKs cause a retransmission. Both Reordering and retransmission are

reported as a percentage of the total number of packets being transmitted (refer to

Equations 4.1 and 4.2)

ReorderingRate =Number of Duplicate ACKS

Total Number of Packets Sent(4.1)

RetransmissionRate =Number of Retransmitted Packets

Total Number of Packets Sent(4.2)

We use packet forwarding throughput (Gbps) as a measure of the network processor

performance. In the following section we study the extent of packet reordering

induced by the architectural features of IXP processor discussed in Section 4.2.1

and 4.2.2.

4.3 Packet Reordering in IXP

4.3.1 Petri Net Model

We extend the Petri net model introduced in Section 3.2 for IPv4 running on IXP

2400. In order to take into account the flow information, used in determining packet

sequence, each token is given two distinct attributes, a flow number and a sequence

number. The CNET Petri net simulator is modified to retain the flow information

associated with tokens. This is used in determining the packet order at the transmit

stage, and hence the reorder and retransmission rates.

4.3 Packet Reordering in IXP 51

4.3.1.1 Petri Net Model for Multiple Hops

In order to study the impact of retransmission on the TCP throughput, the entire

end to end packet flow across multiple routers needs to be modeled. Packet re-

ordering induced by each router can cumulatively add up, leading to a significant

degradation in the TCP throughput. Packets in the Internet traverse, on an aver-

age, 16 hops to reach the destination [28]. We have simulated a network topology

(depicted in Figure 4.5) with multiple routers. We assume that each router in the

Figure 4.5 Simulated Network Topology.

ROUTER 2SOURCE 1 ROUTER 1

IXP 2400 IXP 2400

SOURCE 2 / OTHER ROUTERS SOURCE 3 / OTHER ROUTERS

OTHER ROUTERS

DESTINATION DESTINATION

OTHER ROUTERS

DEST 1.............

SOURCE N / OTHER ROUTERS

above topology uses IXP2400 to forward packets. The multi hop environment is

incorporated by extending the IXP 2400 Petri net model, where the output of one

router (one Petri net model) is given as input to the next router.

We measure packet reordering for one flow, between SOURCE 1 and DEST 1.

Packets from other flows are used to simulate the network workload in the router.

To reduce the complexity of the simulator and the time taken to simulate, we use the

traffic going out of one router itself as the traffic from other sources/routers. This is

reasonable since, in the steady state, the amount and characteristics of traffic leaving

a router is similar to the traffic entering the next router. Hence, in our simulation,


we model only multiple flows from a single source to destination through multiple

routers; but we measure reorder/retransmit rates for 1 out of n flows ( we use n =

10), leaving the other (n-1) flows to model the network traffic entering /exiting the

router in the multiple hop. This models a real network scenario.

4.3.2 Validation

The Petri net model is simulated using the CNET Petri net simulator [40]. We

simulated up to 100,000 packets in each simulation. As before we validate the Petri

net results using the implementation in MicroengineC [17] which is executed on

SDK 3.51 [16]. The validation of the model is performed for different processor

parameters. In this section, the validation is restricted to reordering and retrans-

mission rates for a single hop. The validation is performed for different flow sizes,

640 B, 6.4 KB, 64 KB. We assume a contiguous buffer allocation in the validation.

Further, we assume a 64 B packet size and each flow to contain a fixed number of

packets. In our discussion we use 6.4 KB as the default flow size which is also the

the average flow size reported in the Internet [28]. We also assume a contiguous

buffer allocation.

We assume a network traffic of 3 Gbps, which is higher than the maximum line

rate currently supported (2.5 Gbps for OC-48) by IXP 2400 . We do not model

the network flow of ACK packets (from destination to source). Nor do we assume

rate reduction at source on retransmission. This is done in order to simulate worst

case scenarios as in DoS attacks. Table 4.1 shows the comparison of reordering

Flow Size=640B Flow Size=6.4KB Flow Size=64KB

Reorder Rate Retrans Rate Reorder Rate Retrans Rate Reorder Rate Retrans Rate

CNET 31.7% 5.8% 35.85% 8.35% 36.36% 9.05%

SDK 32.4% 4.7% 33.4% 7.1% 33% 8.2%

Table 4.1: Petri Net Model Validation

and retransmission rates obtained from the Petri net (CNET) and SDK simulations

for a single hop. We observe that the reordering and retransmission rates obtained

4.3 Packet Reordering in IXP 53

from the Petri net model closely match the SDK simulations for different flow sizes.

This essentially validates the Petri net model.

4.3.3 Performance Results

Next we study packet reordering under multiple hops using our Petri net model and

the CNET simulation. In this study we assume a packet size of 64 B and 512 B

and flow size of 6.4 KB. The 64 B packets represent a worst-case scenario and the

512 B packets correspond to the average packet size in Internet. Figure 4.6 shows

Figure 4.6 Packet Reordering in NP.

(a) Reordering (b) Retransmission

reordering and retransmission rates for various packet sizes and for different hops.

We observe that the reordering and retransmission rates increase with the number

of hops for all of the packet sizes. Further, for a 64 B packet size, the percentage of

retransmitted packets is as high as 61% for 10 hops.

However, for a packet size of 512 B, the average packet size in Internet [28],

the reordering and retransmission rates are much lower (46% and 14% respectively).

This occurs as only 8K/512=16 packets can be buffered at the receive buffer at any

given time in case of a 512 B packet size; whereas with a packet size of 64 B as

many as 128 packets can be buffered. So only 25% of the total number of threads,


i.e., 16 out of 64 threads are busy in the IXP processor. This reduces the extent of

concurrent processing and correspondingly the packet retransmission in the network

processor.

Although the retransmission rate is much lower for 512 B packets compared to

that of 64 B, a 14% retransmission is very high [24]. Earlier studies [24] indicate

a retransmission greater than 10% can result in significant reduction (up to 60%)

reduction in packet throughput. For a 16 hop network, average number of hops for

packets in Internet, the retransmission rate can further aggravate.

In the following section we explore different architectural ways to reduce packet

reordering.

4.4 Reducing Packet Reordering

In order to reduce packet reordering and its impact, we explore a few transmit buffer

allocation schemes, as well as architectural parameter tuning in this section.

4.4.1 Buffer Allocation Schemes

The transmit buffer allocation, as observed in Section 3.2, can independently induce

packet reordering. Hence, we explore the following buffer allocation schemes to

reduce reordering.

Figure 4.7 Different Transmit Buffer Allocation Schemes.

TBUF 5

TBUF 4

TBUF 3

TBUF 2

TBUF 6

TBUF 7

TBUF 8

TBUF 9

TBUF 10

.

.

.

TRANSMIT BUFFER

TBUF 1ME 1− T 1

ME1 − T 2

ME1 − T4

ME1 − T5

ME1 − T6

ME1 − T7

ME1 − T8

ME2 − T1

ME2 − T2

ME8 − T8

SYNCH

.

.

.

.

.

.

..

.

HEAD

TAIL

(a) Global

TBUF 5

TBUF 4

TBUF 3

TBUF 2

TBUF 6

TBUF 7

TBUF 8

TBUF 9

TBUF 10

.

.

.

TRANSMIT BUFFER

TBUF 1

ME1 − T 2

ME 1− T 1

SYNCH

SYNCH

ME2 − T1

ME2 − T2

ME2 − T3

ME1 − T3

ME1 − T4

ME2 − T4

TAIL

HEAD

.

.

.

.

.

.

.

(b) Local

TBUF 5

TBUF 4

TBUF 3

TBUF 2

TBUF 6

TBUF 7

TBUF 8

TBUF 9

TBUF 10

.

.

.

TRANSMIT BUFFER

TBUF 1ME 1− T 1

ME5 − T1

ME6 − T1

ME7 − T1

ME8 − T1

ME1 − T2

ME2 − T2

ME8 − T8

ME4 − T1

ME3 − T 1

ME2 − T 1

(c) Strided

4.4 Reducing Packet Reordering 55

• Global Buffer Allocation: In this scheme (depicted in Figure 4.7(a)) the com-

peting threads are allocated transmit buffer space as and when a thread is

ready to move the packet to the TFIFO. Since the transmit buffer is shared

across all the microengines the allocation has to be done using global syn-

chronization, a mutual exclusion operation. The mutual exclusion operation

is performed across all threads in all microengines. The mutex variable is

stored in the scratch pad as it is common to all the MEs. Since synchroniza-

tion is performed across all the microengines this can result in a drop in the

throughput.

• Local Buffer Allocation: In this scheme, shown in Figure 4.7(b), contiguous

sets of locations are allocated to different microengines. But threads within a

microengine compete for a common chunk allocated to that microengine and

access it through a mutual exclusion operation. The transmit buffer is allo-

cated by performing a mutual exclusion operation locally within a microengine.

There is one mutex variable for each microengine. Since only threads within

a microengine share a single mutex variable, the overheads are relatively low

compared to the global buffer allocation scheme.

• Strided Buffer Allocation: This scheme (refer to Figure 4.7(c)), allocates

buffers to microengines and threads a priori. However, unlike the contiguous

case, the buffer is allocated in a strided way. The stride is dependent on the

number of active microengines. So a NP running on 8 microengines will have

a stride of 8. The threads ME1-T1, ME2-T1, ..., ME1-T2 place packets in

TFIFO1, TFIFO2,..,TFIFO9 respectively.

A disadvantage of contiguous and strided allocation, as compared to local or

global buffer allocation, is that they assume a fixed buffer size. In our study

we assume a packet size of 64 B (as in DoS attack, worst case scenario) [29],

or 512 B (average packet size) [28]. In a general situation, as the buffer size

may vary from minimum to maximum packet size, a buffer size equal to the


maximum packet size (1.5 KB) needs to be allocated. This may result in

under-utilization of the transmit buffer when the packet sizes vary widely. On

the positive side, the contiguous and strided buffer allocation schemes enjoy

the benefit of not requiring any synchronization, which leads to better packet

throughput.

Figure 4.8 Impact of Various Buffer Allocation Schemes (64B Packet Size) - CNETresult.


Figure 4.9 Impact of Various Buffer Allocation Schemes (512B Packet Size) - CNETResult.


4.4.1.1 Performance Evaluation of Buffer Allocation Schemes

We have implemented different buffer allocation schemes on the SDK 3.5 simulator.

Table 4.2 reports the throughput achieved for different buffer allocation schemes


Schemes Throughput (Gbps)64B 512B

Contiguous 2.96 3.068Strided 2.96 3.068Local 2.1 2.3Global 1.1 1.4

Table 4.2: Impact of Buffer Allocation schemes on Throughput.

in a single hop network for 64 B and 512 B packet size. We observe that the local

and global allocation schemes suffer significant reduction in throughput. As ex-

plained in the previous section the performance degradation is due to the MUTEX

and the synchronization overhead. On the other hand, the strided buffer allocation

performs as well as contiguous allocation maintaining high packet throughput. The

throughput remains the same in each scheme as the packets go through multiple

hops. Hence we do not report throughput results for multiple hops. The impact of

various schemes on reordering and retransmission is shown in Figures 4.8 and 4.9

for 1, 5, and 10 hops for 64 B and 512 B packets. First let us look at the perfor-

mance results of 64 B packets. While strided and contiguous allocation result in

significant retransmission rates (greater than 55%) for 10 hops, the local and global

schemes reduce the retransmission rates to 45% and 33% respectively. However,

the throughput achieved by global and local schemes, 2.1 Gbps and 1.1 Gbps, are

unacceptably low. Although the retransmission rate is alarmingly high for a 10 hop

network for 64 B packets, it may not be a cause of major concern as a major part

of 64 B traffic usually corresponds to a DoS attack. So, the sender may not react to

duplicate ACKs.

On a more realistic situation, when the packet size is 512 B, the retransmission

rates are 15%, 12%, 3%, 2% for contiguous, strided, local and global buffer allo-

cation. While local and global allocation schemes achieve very low retransmission

rate, their throughput is also very low. From this discussion we observe that there

exists a trade-off between the throughput and the retransmission rate achieved by


each scheme. The retransmission rates of strided allocation scheme (12%) is still

considered to be high to cause significant degradation in TCP performance [24].

The global scheme completely eliminates packet reordering due to transmit buffer

allocation. The reordering/retransmission experienced in this scheme is entirely due

to the concurrency processing of packets by the MEs and threads.

We study the impact of the architecture parameters, such as number of mi-

croengines and number of threads on retransmission in the following subsection.

Further, since the strided buffer allocation while reducing the retransmission rate as

compared to the contiguous allocation does not impact the throughput, we use the

strided buffer allocation in the following sections.

4.4.2 Tuning Architecture Parameters

Performance studies in the earlier section indicate that the packet throughput sat-

urates beyond a total of 16 threads (refer to Section 3.3.2). The throughput results

for different number of threads for packet forwarding is reported in Table 4.3 for

easy reference.

Figure 4.10 Impact of Number of Microengines (64B Packet Size) - CNET Result.


The performance saturation occurs as the memory (DRAM) saturates beyond 16

threads. Then the additional threads (beyond 16) do not contribute to performance

improvement while at the same time they could adversely impact the packet reorder-

ing and retransmission. Therefore we study the effects of number of microengines


Figure 4.11 Impact of Number of Microengines (512B Packet Size) - CNET Result.


and number of threads using our Petri net model in the following subsection.

4.4.2.1 Impact of the Number of Microengines

A network processor with fewer microengines and/or fewer additional threads, while

giving the same throughput can reduce the reordering due to concurrent processing.

Figures 4.10 and 4.11 show the impact of number of microengines (each microengine

running 8 threads) on packet reordering/retransmission. We observe that the packet

retransmission drastically reduces from 56% (64 B) and 12% (512 B), for 8 ME x

8 threads, to 19% (64 B) and 5%(512 B), for 2 ME x 8 threads. This reduction in

retransmission rates is achieved without any penalty on packet throughput.

Number of Threads Transmit Rate (Gbps)64 (8x8) 2.96

32 (4x8,8x4) 2.9616 (2x8,4x4,8x2) 2.96

Table 4.3: Transmit Rates for Different Number of Threads.

Thus, a network processor using 2 or 3 microengines can reduce retransmission

by up to 27% for 64 B packet and 5% for 512 B packet, while providing a transmit

rate of a 8 microengine. Further, reducing the number of microengines reduces the

demand for VLSI area in the NP which could otherwise be used for accelerators or


functional units like the hash unit or crypto units.

4.4.2.2 Impact of the Number of Threads

Figures 4.12 and 4.13 compare the impact of number of active threads on reordering

and retransmission.

Figure 4.12 Impact of Number of Threads (64B Packet Size) - CNET Result.


In the above figure, a 4x8 configuration refers to 4 microengines with each mi-

croengine running 8 threads. It is interesting to note that configurations running

the same total number of threads give different retransmission rates. For example,

a 4x8 configuration reduces the retransmission for 1, 5, and 10 hops by up to 21%

as compared to an 8x4 configuration. Both configuration give a throughput of 2.96

Gbps (refer to Table 4.3). This indicates that the impact of multiple microengines

on packet ordering is more severe compared to multiple threads for the proposed

strided allocation. A similar trend, as explained earlier, is observed for 512 B packet

size although the reduction in retransmission/reorder rates are lower. This is due

to the limited buffering possible with the 512 B packet size.

4.4.3 Packet Sort: An Alternative Scheme

Our study on buffer allocation schemes indicate that while global and local buffer

allocation schemes can reduce the retransmission rates, they also incur significant


Figure 4.13 Impact of Number of Threads (512B Packet Size) - CNET Result.


performance penalty due to synchronization. Hence, in this subsection we explore an

algorithmic approach to eliminate reordering ensuring that the performance penalty

in throughput is minimized. We propose a packet forwarding scheme, Packet sort,

where the packet processing is pipelined (refer to Figure 4.14).

Figure 4.14 Packet Sort Implementation in the IXP.

PKT PROC

PKT PROC

PKT PROC

PKT PROC

PKT PROC

PKT PROC

ME4− T8

ME2 − T4

ME2 − T4

ME2− T3

ME2− T2

ME2− T1

TRANSMIT STAGE

PKT TX

ME6− T1

ME6− T2

ME6− T3

.

.

.

ME8− T8

TIME (t)

PACKET PROCESSING STAGE

ORDERING STAGE

ME5 − T1

INSERTION SORT

In this scheme the packet processing is partitioned into three stages. In the first

stage, the packet processing stage, 4 microengines concurrently move the packets


from RFIFO to DRAM and subsequently process them (based on the packet for-

warding application). Packets are placed in DRAM, by the threads from the first

stage, based on the flow information. In the second stage, the packet ordering stage,

the 1 microengine sort the packets based on the flow information and store it in the

scratch pad. The overhead involved in the sorting is minimal as the microengine

utilization is low in the second stage. The sorted packet addresses and the corre-

sponding transmit buffer addresses are stored in the scratch pad and communicated

to the remaining 3 microengines which execute the third stage of Packet sort, the

transmit stage. In this stage the packet is moved from the DRAM to the TFIFO

and further to the MAC by the 3 microengines. The scratch pad is used for the

communication between the pipe stages. We have implemented the Packet sort ap-

proach for packet forwarding in MicroengineC and measured its performance using

SDK. We have also developed the Petri net model of Packet sort and compared the

performance obtained from SDK and the Petri net simulation.

Packet sort completely eliminates packet reordering and gives a throughput of

2.5 Gbps. This approach is attractive as the network processor is able to support

current line rates (2.5 Gbps) for 10 or more hops.

Scheme Concurrent Flows Throughput (Gbps)SDK CNET

PacketSort

32 2.56 2.310 2.5 2.31 1.7 1.6

ITS NA 2.3 2.1AISR NA 1.1 0.960

Table 4.4: Comparison of Various Schemes to Overcome Reordering.

Table 4.4 reports the performance of packet sort for different number of con-

current flows. In this experiment, we assume a constant line rate (of 3 Gbps) and a

fixed packet size, of 64 B. The number of concurrent flows determines the flow size

as well as the number of packets per flow. the throughput of Packet sort decreases

from 2.5 Gbps to 1.7 Gbps as the number of flows decrease from 32 to 1 (refer to


Number of Concurrent Flows ME Util%1 76.2410 6032 40

Table 4.4). Note that even with 10 concurrent flows, which correspond to average

flow size in Internet, the throughput achieved by Packet sort is 2.5 Gbps. The Petri

net simulation results also exhibit a similar trend. The reason for the decrease in

the throughput with fewer concurrent flows is as follows. With a fixed line rate the

number of packets per flow decreases with the number of concurrent flows resulting

in a larger overhead in the sorting operation (refer to Table 4.4.3).

Next we compare the performance of Packet sort with those of in-built schemes

namely AISR and ITS. For this purpose we implement AISR with 1 microengine

performing the buffering operation, 4 microengines performing the packet processing

block, 1 microengine executing the reordering block, and 2 microengines running

the transmit block. The ITS runs totally parallel, with threads in all microengines

performing the complete IPv4 forwarding. Note that the ITS and AISR schemes are

not affected by the number of concurrent flows, as these schemes maintain a strict

packet order. It is interesting to note that AISR performs poorly as compared to the

other schemes. In particular the ITS is able to support a line rates of 2.3 Gbps which

is close to OC 48 line rates, but the AISR supports only a line rate of 1.1 Gbps.

This occurs as there are only 8 threads buffering the packets to the DRAM, the

first stage in AISR. This coupled with the saturation of the DRAM result in a lower

throughput. However an increase in the number of threads to 16 for the first stage

resulted in reduction in throughput to 0.9 Gbps since a global synchronization needs

to be done across all the 16 threads. While our implementation of AISR may not

be the most efficient. Hence to estimate an upper-bound for the AISR performance


taking into account all DRAM transactions, including RFIFO-DRAM and DRAM-

TFIFO.1. To obtain the upper bound we consider only the receive block of AISR

to be running in the SDK. We observe that the maximum possible throughput is

2.1 Gbps. Hence the AISR throughput is limited to a maximum of 2.1 Gbps. So

packet sort gives a throughput improvement of at least 16% with respect to the

upper bound of AISR.

4.5 Summary

This chapter studies the impact of parallel processing in network processor on packet

reordering and retransmission. We observe that in addition to the reordering due to

parallel processing the transmit buffer allocation adversely impacts reordering. We

summarize our contributions as follows:

• Our results reveal that the transmit buffer allocation significantly impacts

reordering and results in a packet retransmission rate of up to 61%. We explore

different transmit buffer allocation schemes namely, contiguous, strided, local,

and global for transmit buffer. The strided buffer allocation reduces the packet

retransmission by up to 24% while retaining the packet throughput of 3 Gbps.

Global and local buffer allocation schemes reduce retransmission rates further

but at the expense of performance.

• We study the impact of architecture parameters, namely, number of micro-

engines and number of threads on packet reordering. A network processor with

fewer microengines (2 or 3) or fewer threads (4 threads per microengine) can

significantly reduces the retransmission rate while achieving the same through-

put.

1Earlier studies assume that the packets are already available in DRAM and do not accountfor RFIFO to DRAM or DRAM-TFIFO transfer. Our performance evaluation studies , in Section3.3, indicates that DRAM saturates the performance.

4.5 Summary 65

• We propose an alternative scheme, Packet sort, which dedicates a certain num-

ber of threads to sort the packets to eliminate retransmission. This scheme

provides a line rate of 2.1 Gbps which is close to the current line rate. we

observe that Packet Sort outperforms, by up to 35%,the in-built schemes in

the IXP processor namely - Inter Thread Signaling (ITS) and Asynchronous

Insert and Synchronous Remove (AISR) by upto 35%.

Chapter 5

Performance Analysis of Network

Processor in Bursty Traffic

The previous two chapters evaluated the performance of the network processor with

a Poisson packet arrival, where the packet size is constant and equal to 64B. In

this traffic the minimum packet size of 64 B is assumed to simulate the worst case

scenario encountered by a router. Earlier works on network processor performance

evaluation [29] [9] also consider only a similar scenario. However, earlier works on

traffic characterization [8] indicate that, on an average, only 10% of the traffic is due

to DoS attacks. Hence, the performance of the network processor in a more realistic

traffic needs to be evaluated. Earlier work on Internet traffic characterization observe

that the traffic is self-similar and bursty in nature and develop a mathematical model

for the traffic.

A bursty traffic possesses the property of self-similarity [25]. Self-similarity is

the property in which a stochastic process (in this case the packet arrival) has the

same statistical properties at any time scale. So, a self-similar traffic will be bursty

at small time scales without any constant burst length. In effect, it is impossible

to predict (using mathematical models) whether a burst will occur and the burst

length. We develop a Petri net model of a realistic traffic based on the theoretical

model proposed for bursty traffic [23]. The performance of the network processor is

5.1 Motivation 67

evaluated using this model. Further, this chapter evaluates the necessity of a store-

forward architecture for a network processor and explores various packet buffering

architectures.

The rest of the chapter is organized as follows. The following section provides

the motivation for this study. Section 5.2 describes the traffic model and Section

5.3 develops a Petri net model of the traffic generator used in the study. In Sec-

tion 5.4 we present the different packet buffering schemes. Section 5.5 presents the

detailed performance analysis and evaluation of the network processor. We provide

concluding remarks in Section 5.6.

5.1 Motivation

In this section we explain the need for a store-forward architecture in a network

processor [7]. The store-forward architecture is used in a network processor due to

the following reasons :

• Limited Buffering Space in RFIFO. The Receive FIFO size is only 8 KB. So

if packets of size 1536 B are streaming into the network processor, then the

RFIFO will have space to buffer only 5 packets. Even with 512 B packets

the RFIFO can only buffer 15 packets. An application such as IP forwarding

requires on an average 2666 nanoseconds for the processing of a single packet

[29] of size 512 B. However, the inter-arrival time between the packets is 1638

nanoseconds. Thus packets arrive in the system at a higher rate as compared

to the processing rate. The RFIFO can at best buffer 15 packets (of 512 B).

Hence there is a need to buffer packets in larger memories such as DRAMs.

This makes the network processor a store-forward architecture.

• Bursty Traffic. Consider the traffic arrival in a router as shown in Figure 5.1.

Packets arrive at the receive FIFO of R1 which uses a network processor to

forward the packets. The traffic as seen in the RFIFO of R1 may contain

peaks and troughs as in a bursty arrival of packets. The network processor

68 Performance Analysis of Network Processor in Bursty Traffic

Figure 5.1 Packet Arrival in NP.

NO OF BYTES ROUTER WITH NETWORK PROCESSOR

INPUT TRAFFIC

TIME −> t1 t2

in R1 buffers the incoming packets in DRAM, processes these packets and

forwards the packets to the next hop. At time instant t1 the packet arrival

rate may be higher than the maximum supported line rate of the network

processor. However at time instant t2 the arrival rate drops. So if the network

processor buffers the packets at t1 temporarily then it can process the packets

with minimal packet drop. Otherwise a significant packet drop will occur at

time t1 . So a store-forward architecture is used to minimize packet drops that

occur due to sudden bursts in the network traffic.

In this chapter we study various packet buffering schemes for a store-forward ar-

chitecture. The following section describes the traffic generator model used in the

simulation.

5.2 Generation of Bursty Traffic

Earlier works on traffic measurement indicate that the Internet traffic is bursty and

self-similar in nature [8]. We use a traffic model similar to [25] to simulate a bursty

traffic. The rest of the section describes the model of the traffic generator.

Figure 5.2 shows the traffic generation model [23] used in this study. In this

5.2 Generation of Bursty Traffic 69

Figure 5.2 Bursty Traffic Generation.

ON/OFF STREAM 1

PKT SIZE = 96 B

ON/OFF STREAM 2

PKT SIZE = 80 B

ON/OFF STREAM 48

PKT SIZE = 1536 B

ON/OFF STREAM 1

PKT SIZE = 64 B

AGGREGRATOR

SYNTHETIC SELF

SIMILAR TRAFFIC

.

.

.

.

.

model the traffic is assumed to be an aggregate of different sub-streams. Each sub-

stream is assumed to be of constant packet size with finite ON/OFF periods. In

the ON time each sub-stream generates packets of constant size and restarts the

packet generation after the OFF period. The arrival rate in the ON period within

a sub-stream is equal to (packetsize)/(linerate). The ON and OFF times of each

sub-stream are Pareto distributed with a probability distribution function f(x) given

by

f(x) = αβα/xα+1 (5.1)

where α represents the shape parameter and β represents the scale parameter [32].

The shape parameter α can take values between 1 and 2 and the scale parameter

β is the ON/OFF time. The resultant traffic generated is used as the input to the

network processor.

It has been shown that the traffic generated using the above methodology is

bursty and similar to the traffic commonly encountered by routers [25]. The above

traffic generator is modeled using Petri nets. The following subsection describes the

Petri net model of the traffic generator.


5.3 Petri net Model of the Traffic Generator

Figure 5.3 shows the Petri net model of the traffic generator. The place NET-

WORK1 represents the external link. Packets arrive in the MAC at a constant rate

equal to LINERATE for BURST-TIME amount of time. This models the ON

period. These packets are buffered in the MAC which is represented by WAITMAC

place. After BURST-TIME time, a token is placed in WAIT1 which subsequently

results in the firing of the transition BURST1. This removes the token from NET-

WORK1 which temporarily stops the traffic generation. The traffic generation is

resumed after IDLE TIME when a token is placed in NETWORK1 and NET1.

The IDLE-TIME corresponds to the OFF period.

Figure 5.3 Petri Net Model of Traffic Generator.

LINERATE

BURST1

NET1

BURST−TIME

WAIT1

IDLE1

IDLE−TIME

WAIT MAC

NETWORK1

In this model NETWORK1 and NET1 are initially assigned 1 token each to

start the packet generation. The multiple sub-streams (sources) are modeled using

colored Petri nets where each source is assigned a given color. The traffic generated

from each sub-stream is combined in WAITMAC . The packets generated by each

sub-stream can have different packet sizes. The attribute of the token is modified

to return the packet size. In our simulation experiments we varied the packet sizes

in steps of 16B starting from 64B. Further, a total of 48 different sub-streams are

used to generate the traffic.

5.4 Packet Buffering Schemes 71

5.4 Packet Buffering Schemes

We evaluate the performance of the following packet buffering schemes :

• Parallel Execution Model. In this scheme each thread in the network processor

buffers the packet in DRAM, processes the packet, and moves the packet to the

TFIFO. This packet flow in the network processor is same as that described

in Section 2.1.1.2.

• Packet Buffering with a Pipelined Packet Flow In this scheme the packet flow

Figure 5.4 Pipelined Buffering Scheme.

ME3

ME4

ME5

ME6

ME2

ME1

RxPACKET

STAGE STAGE

PACKETPROCESSING

PACKETTx

STAGE

ME8

ME7

TIME T−>

is divided into three stages (refer to Figure 5.4), namely, the packet receive

stage, packet processing stage, and packet transmit stage. Each of the three

stages are assigned to different microengines. 2 microengines receive and buffer

the packets, 4 microengines run the packet processing algorithm parallely, and

the remaining 2 microengines run the transmit stage of the pipeline. Hence,

in this scheme each microengine is responsible for a subtask in the execution

of the packet.

We extend the Petri net model developed in Section 3.2 to simulate these packet

buffering schemes running the IPv4 forwarding application. The traffic generator,


described in previous section, is used as the input to the validated Petri net model

of IXP 2400 processor, developed in section 3.3 and simulated using CNET [40].

5.5 Results

In our study we assume a maximum line rate of 20 Gbps. Further, we assume

that to access the initial 8B of data, the DRAMs and SRAMs take 50 nanoseconds

and 8 nanoseconds respectively [18]. However to access larger chunks of data

(like 64B) in DRAM which are in contiguous memory locations, only an additional

5 nanosecond per 8B is required [12]. The traffic is generated using 48 sources

with each source generating a traffic of constant packet size. But the packet sizes

generated by different sources vary. In our simulation the packet size increases in

steps of 16B for each source; i.e, packets generated by source 1 is 64 B, and the

packets generated by source 2 is 80 B and so on (refer Figure 5.2).

Figure 5.5 shows the traffic generated, in terms of the number of bytes, using

the Petri net model described in Section 5.3. This traffic generated on a 2.5 Gbps

link (OC 48 link). We observe that this traffic is characterized by alternate peaks

and troughs. The number of bytes in a burst, i.e, in a time frame of 1 millisecond1,

varies from 5x106 to 5x107 bytes. This traffic, shown in Figure 5.5, is used as an

input to the network processor and used in the performance evaluation.

The following subsection evaluates the performance of different packet buffering

schemes in a bursty traffic scenario.

5.5.1 Impact of Packet Buffering

Tables 5.1, 5.2 and 5.3 compare the performance of different packet buffering

schemes explained in Section 5.1 for bursty traffic for different input line rates. We

observe that packets are dropped in case of the parallel buffering scheme even at low

11 millisecond = 6x105 IXP clock cycles

5.5 Results 73

Figure 5.5 Bursty Traffic Generated using 48 sources.

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

3e+07

3.5e+07

4e+07

4.5e+07

5e+07

0 2000 4000 6000 8000 10000

Num

ber

of B

ytes

s in

a b

urst

(6x

10e

5 cl

ock

cylc

es)

Time Interval

48 source w idle time of 6x10 e 8

line rates of 1.7 Gbps. This occurs due to insufficient buffering space in the RFIFO

and/or the thread non-availability.

To justify this inference we measure the percentage utilization of RFIFO, per-

centage thread non-availability and percentage RFIFO non-availability.

• Thread Non-Availability : This parameter gives the percentage of time an in-

coming packet (to the RFIFO) is not processed due to the unavailability of

free threads. Note that in the pipelined buffering scheme the thread non-

availability is measured only for the receive stage of the pipeline (see Figure

5.4).

• RFIFO Non-Availability : This parameter gives the percentage of time the

RFIFO occupancy is full when a packet arrives at the RFIFO. A packet is

dropped either when the RFIFO is full or when there are no free threads

available.

• RFIFO Utilization: The time average number of free bytes available in the

RFIFO. When this metric is expressed as a percentage it is

RFIFO UTIL =Number of Free Bytes Available in RFIFO

8192 B∗ 100 (5.2)


We observe that the RFIFO Non-Availability is 8% and 45% for the parallel

scheme for line rates of 1.7 Gbps and 6 Gbps respectively. In contrast the pipelined

buffering scheme has a non-availability of 0% and 15% for line rates of 1.7 Gbps

and 6 Gbps respectively. Packet drop happens in parallel even at lower line rate for

bursty traffic whereas in the pipelined buffering scheme no packets are dropped. This

trend of higher packet drop in parallel buffering scheme is even more pronounced at

higher line rates. This is because under the parallel buffering scheme, a single thread

is responsible for the entire processing of packets and a thread is occupied during

the entire period of packet processing. Thus once all the 64 threads in IXP 2400

are busy and if there is a burst of packets then due to lack of buffering the packets

can get dropped. Whereas in the pipelined scheme, a dedicated set of microengines

and threads, even though they are fewer in number, can move packets from the

RFIFO to DRAM. Whereas in case of the pipelined scheme a thread in the Rx stage

is only responsible for moving the packet from RFIFO to DRAM. As an evidence,

we present the utilization of the RFIFO (refer to Table 5.1, 5.2 and 5.3). The

pipelined buffering scheme buffers packets in the Rx stage (refer to Figure 5.4) and

process these packets when the line rate drops. However, in the parallel scheme

packets are either processed instantly or they are dropped.

Schemes Output Rates Drop RFIFO RFIFO ThreadGbps Rates Util. Non-Avail. Non-Avail.

Parallel Buffering Scheme 1.6 0.1 14% 8% 7%Pipelined Buffering Scheme 1.7 0 15% 0% 0%

Table 5.1: Output Line Rates supported with Input Rate of 1.7 Gbps

Table 5.4 compares the performance of the packet buffering schemes for an

exponential arrival with fixed packet size of 64B (DoS attack scenario). In contrast

to the bursty traffic scenario both these schemes support line rates of up to 2.96

Gbps without any packet drop. We observe that in both the schemes the line rate

is limited by DRAM bandwidth. Note that in this case, packet buffering does not

5.6 Summary 75


Parallel Buffering Scheme 2.29 0.85 47% 37% 11%Pipelined Buffering Scheme 2.9 0.24 23% 5% 7%

Table 5.2: Output Line Rates supported with Input Rate of 3.14 Gbps


Parallel Buffering Scheme 4.3 1.7 65% 45% 21%Pipelined Buffering Scheme 5.3 0.7 27% 15% 11%

Table 5.3: Output Line Rates supported with Input Rate of 6 Gbps

provide any advantage as the network traffic is streaming at a constant.

Schemes Transmit Rate (Gbps)Parallel Buffering Scheme 2.96

Pipelined Buffering Scheme 2.98

Table 5.4: Maximum Output Rates for Different Packet Buffering Schemes.

5.6 Summary

This chapter evaluates the performance of a network processor in bursty traffic.

Further, this chapter also examines the need for a store-forward architecture in a

network processor and also explores different packet buffering schemes. A bursty

traffic is simulated based on a mathematical model developed in earlier work. This

model is integrated into the Petri net model of the network processor and simulated

using CNET. Our results indicate that the parallel buffering scheme suffers signifi-

cant drop in packet throughput (up to 30%) due to packet drop under bursty traffic

even at line rates of 3.14 Gbps. However, the pipelined buffering scheme is able to

support a line rate of 3.14 Gbps. For an input line rate of 6 Gbps the pipelined


buffering scheme achieves line rates upto 5.3 Gbps. Also this scheme has lesser

packet drops as compared to a parallel scheme at lower line rates. However, in case

of a constant packet size with exponential arrival the parallel and pipelined scheme

give similar transmit rates.

Chapter 6

Related Work

In this chapter we present a discussion on related work on network processors.

Section 6.1 discusses the related work in the performance evaluation of network

processors. Section 6.2 deals with the related work in the area of packet reordering

and bursty traffic evaluation.

6.1 Network Processor Performance Evaluation.

Crowley et al. [7] evaluate the performance of different processor architectures

for network applications using a trace-driven simulation approach. This work eval-

uates the performance of four architectures namely - superscalar, fine grain mul-

tithreaded (FGMT), simultaneous multithreaded (SMT), and chip multiprocessors

(CMP). This work evaluates the performance of these architectures on three dif-

ferent applications, viz., IPv4 forwarding, Message Digest 5, and Data encryption

applications. This work explores the impact of different processor parameters like

number of threads, processor clock rate for the different architectures and applica-

tions. This work concludes that the SMT architecture is best suited for network

processors. However, a major drawback of this work is that they do not model the

buffering of packets in the network processor. Our study shows that packet buffer-

ing is the bottleneck and the processor-memory interaction needs to be modeled in

78 Related Work

greater detail.

Spalink et al. [29] evaluate the performance of IPv4 forwarding running on a

IXP 1200 processor. This study evaluates the performance of the processor using

the Intel SDK 3.5 tool [16]. This work also uses a DoS traffic model for incoming

traffic with packets of 64 B size. This work also studies the impact of number of

threads on the throughput of the network processor. Their results indicate that

DRAM is the bottleneck resource and limits the throughput to 1.377 Gbps.

Wolf et al. [9] use an analytical approach to model a network processor. This

model derives mathematical equations for throughput, processor utilization, and

memory access based on existing models for multithreaded architectures [1] and

standard queuing models. Further, this study evaluates the impact of number of

threads, processor clock rate, data cache and instruction cache sizes. This work

also evaluates the impact on the chip area due to these parameters and uses a

performance-area metric to evaluate the performance of the network processor. Fi-

nally, this work gives insights, based on their results, into future directions in network

processor trends. A major drawback with this work is that it does not model the

processor-memory interaction in detail. Our work, on the other hand, uses a Petri

net model for performance evaluation. A salient feature of our approach is that

the PN model models the network processor, architecture, application and their

interaction.

Thiele et al. [31] develop a genetic algorithm framework for the design space

exploration of network processors. Given an application, flow characterization, and

available resources, their framework schedules the tasks of the applications and

binds them to the resources. More specifically, their framework consists of a task

and resource usage model. The task model models the different packet processing

functions such as header or payload processing functions. The resource model cap-

tures the utilization of various resources. Further, this work explores the different

resource-task mapping and finds an optimum mapping based on constraints like the

chip area.

6.2 Packet Reordering in Network Processors. 79

Weng and Wolf [37] construct an annotated directed acyclic graph of the ap-

plications using run time traces and use this to perform design space exploration.

They use a randomized algorithm to perform mapping of nodes to processors and

memories. The system throughput is determined by modeling the processing time

for each processing element in the system based on its workload, the memory con-

tention on each memory interface, and the communication overhead between the

pipeline stages. Memory contention is modeled using a queuing network approach.

At the end of the mapping process the best overall mapping is reported. While the

work reported in [31, 37] deal with design space exploration of application program,

and its corresponding mapping to a given network processor architecture, our work

deals with a few architecture enhancements for performance improvement.

Wolf et al. [38] develop a benchmark namely, Commbench, to evaluate the

performance of network processors. This benchmark suite classifies the network ap-

plication broadly into header processing and payload processing applications. The

classification is done based on the amount of processing required for applications.

Based on simulations, this work contrasts the performance of Commbench with the

standard SPEC benchmark. This work characterizes the performance of Comm-

bench applications with respect to instruction, data set locality, type of instructions

like the number of loads/stores. A drawback with this work is that it evaluates the

performance on a generic processor rather than a network processor.

6.2 Packet Reordering in Network Processors.

Benett et al. [5] are one of the earliest to report the problem of reordering and

its potential impact on the network throughput. They study the impact of packet

reordering in a backbone link. They observe that the parallelism existing in the

Internet due to multiple parallel paths leads to reordering of packets. They pro-

pose a modification in the TCP protocol to take into account the impact of packet

reordering.

Laor et al. [24] artificially introduce packet reordering and study the impact

80 Related Work

of reordering and retransmission on throughput. This work evaluates the effect

of packet reordering on the application throughput by simulating a backbone link.

This study evaluates the impact on application throughput by artificially inducing

reordering for multiplexed flows. Further, this work also evaluates the impact of

number of concurrent flows arriving in the router. This study indicates that a

retransmission of 10% of packets can reduce the network bandwidth significantly by

up to 60%.

Jaiswal et al. [19] classify the reordered/retransmitted packets as those arising

from routing loops, network duplication, due to the loss of the packet, and due to

the parallelism in packet transmission. They propose an evaluation methodology

to classify the packets and evaluate it on a backbone link. Their results indicate

are transmission of about 5% most of which is due to packet loss. The impact of

network anomalies like reordering is lesser as compared to the packet loss. However,

this work does not measure the impact of the network anomalies on the application

throughput. While these works evaluate packet reordering and its impact on packet

retransmission their studies are specifically not on network processors. Hence they

do not consider the effects of concurrency and FIFO ordering on packet reordering.

Our work focuses on the impact of the network processor architecture on the packet

reordering/retransmission. Modern network processors have in-built mechanisms

(ITS and AISR) to overcome packet reordering. We compare the performance of

these schemes with our proposed scheme in Section 4.4.3.

There have been several attempts to characterize the Internet traffic. Zhang et

al. [39] are one of the earliest to characterize the Ethernet traffic. They observe that

the Ethernet traffic is self-similar in nature. They conclude that the traffic exhibits

fractal like behavior. They further conduct a rigorous statistical analysis on Ethernet

traffic collected between 1989 to 1992. The mathematical model developed in this

paper is further extended to the Internet traffic. Taqqu et al. [35] provide a unique

way to generate self-similar traffic. This work uses a superposition of a number of

ON/OFF sources (referred to as packet trains) which generates a Constant Bit Rate

6.2 Packet Reordering in Network Processors. 81

traffic in the ON period. This method is used in several traffic generators like the

NS. Kramer et al. [23] extend this traffic generation to simulate Internet traffic.

They use a Poisson distribution for the ON time. This thesis uses a similar traffic

generation model.

Chapter 7

Conclusions

7.1 Summary

This thesis deals with the performance evaluation of network processors. We develop

a Petri Net model for a commercial network processor (Intel IXP 2400,2850) for dif-

ferent applications. The Petri net model is developed for three different applications

viz., IPv4 forwarding, Network Address Translation and IP security protocols. The

performance results are obtained by simulating Petri nets using CNET simulator.

Further this model is validated using the Intel proprietary SDK simulator. A salient

feature of our model is its ability to capture the architecture, applications and their

interaction in great detail. Initially, we study the performance of network processors

using Poisson arrival rate. The IXP 2400 achieves a throughput of 2.96 Gbps for

IPv4 and NAT and the IXP 2850 can achieve a throughput of 3.6 Gbps for IPSec

application. Our performance results indicate that the DRAM memory used for

packet buffering is the bottleneck. Our study shows that multithreading is effective

only up to a certain number of threads. Beyond this threshold packet buffer mem-

ory (DRAM) is fully utilized and increasing the number of threads is not beneficial.

Further, we observe that increasing the number of microengines beyond 4 provides

no additional gain in the throughput. In order to reduce the utilization of DRAM

7.1 Summary 83

we store packet headers in SRAM. We obtain up to 20% improvement in the trans-

mit rate at no additional hardware cost. Since DRAM is the bottleneck we explore

increasing the number of DRAM banks. Our result indicates that a network proces-

sor with 8 DRAM bank improves the throughput by upto 20%. However when the

number of DRAM banks is 8, the hash unit, a task specific unit used for performing

hardware lookup, becomes the bottleneck. Increasing the number of hash units from

1 to 2 gives an improvement in the throughput to 4.8 Gbps, an improvement of 60%

as compared to the base case. We further observe that an identical improvement is

obtained by using two hash units but fewer number of microengines (4 MEs). So

given a fixed die area, a NP architecture with lesser number of processors (support-

ing 16 or more threads) but more task specific units, with respect to the base IXP

architecture, gives a better performance.

The second part of the thesis studies the impact of packet level parallelism pro-

cessing in a network processor on packet reordering and retransmission under the

fast retransmission model. The Petri net model developed in Chapter 3 is extended

to take into account packet attributes like Source and Destination IP addresses and

sequence number. Further, the model is extended to study the impact of reordering

on a multi-hop network. We evaluate the impact on reordering and retransmission

for a multi-hop environment of 1, 5, and 10 hops and for different packet sizes, 64

B and 512 B. We observe that the reordering/retransmission increases non linearly

with the number of hops, reaching up to 60% for 10 hops. Our results indicate that

the parallel architecture of the network processor can severely impact reordering

and can cause up to 60% retransmission in a 10 hop scenario. Further, we observe

that in addition to reordering due to parallel processing, transmit buffer allocation

for each thread in a microengine severely impacts packet reordering. This is due

to the strict FIFO order dequeuing from the transmit buffer explained in detail in

Section 2. Hence we explore the following buffer allocation schemes - global, local,

contiguous, and strided buffer allocation.

84 Conclusions

In the global buffer allocation, transmit buffer space for threads from different

microengines are allocated in a critical section and the threads compete for a mutex

lock to enter the critical section. This reduces reordering to 14% but also drastically

reduces the throughput to 1 Gbps. Hence we explore a local buffer allocation scheme

where only threads from the same microengine compete for a common buffer space

and different microengines are allocated a fixed buffer space a-priori. This scheme

results in a reordering of 18% but gives a throughput of 2 Gbps.

The global and local allocation schemes use mutual exclusion for transmit buffer

allocation. These schemes significantly reduce the packet reordering, 14% in global

and 18% in local, but also result in a lower performance loss, 1 Gbps in global and

2 Gbps in local buffer allocation. On the other hand, a static buffer allocation,

contiguous and strided, without any mutual exclusion gives a transmit throughput

of 3 Gbps but with a packet reordering of up to 33.4%. In strided buffer allocation

threads within a microengine are allocated a fixed space, decided a-priori, with

a fixed stride between them. Threads from different microengines are allocated

successive locations in the transmit buffer.

Further, our results indicate that packet reordering reduces for a network proces-

sor with fewer number of microengines without significantly affecting the throughput

rates. The retransmission rate reduces from 61% (for 10 hops), for a network proces-

sor with 8 microengines and 8 threads, to 19% (for 10 hops) for a network processor

with 2 microengines and 8 threads or 4 microengines and 4 threads. This is achieved

without sacrificing the performance (2.96 Gbps). This is because the throughput of

the network processor saturates beyond a total of 16 threads as observed in Section

3.4.3. Based on this observation we propose a scheme, Packet sort, in which a few mi-

croengines/threads are dedicated to sort the packets in-order at the transmit buffer

side. Packet sort is able to support a line rate of up to 2.5 Gbps without any packet

7.2 Future Directions 85

reordering. Our results indicate that Packet sort achieves a significant through-

put improvement, of up to 35% over the in-built schemes in the IXP, namely, Inter

Thread Signaling (ITS) and Asynchronous Insert and Synchronous Remove (AISR).


in a bursty traffic scenario. We model bursty traffic using a Pareto distribution.

Further, we explore various packet buffering schemes. In particular we consider a

parallel and pipelined packet flow architectures. Our results indicate that the parallel

scheme supports line rates up to 4.3 Gbps and the pipelined scheme supports line

rates up to 5.3 Gbps.

7.2 Future Directions

Below we list a few possibilities for extending our work.

• The Petri net model developed in Chapter 3 executes a single application.

However, with increasing functionalities at the network layer there is a need

to support multiple concurrent applications in routers. For example, modern

edge routers will run a cryptographic application, a port-scan application for

attacks, a virus scan for detecting viruses, in addition to forwarding. Further,

the same router might also run the network address translation. Hence it will

be interesting to evaluate the performance of the network processor multiple

concurrent applications.

• Our study on packet reordering does not model the network flow of ACK

packets (from destination to source). Nor do we assume rate reduction at

source on retransmissions in a fast retransmit algorithm. Our study can be

extended to take these into account and study the performance. Further,

the retransmission of a packet due to reordering results in the packet being

resent. It is important to note that these retransmitted packets will not need

any processing in case they are still available in the DRAM. Heuristic can be

developed to forward these packets without any additional processing costs

86 Conclusions

being incurred by the network processor.

Bibliography

[1] A. Agarwal. Performance tradeoffs in multithreaded processors. IEEE Transac-

tions on Parallel and Distributed Systems, 3(5):525-539, Sept 1992.

[2] F. Baker. RFC 1812 - Requirements for IP Version 4 Routers, June 1995.

[3] J. Banks, J. Carson, and B. Nelson. Discrete-Event System Simulation. Prentice

Hall International, 1998.

[4] J. Bellardo, S. Savage. Measuring packet reordering. Proceedings of the ACM

SIGCOMM IMW, Marseille, France, November 2002.

[5] J. Bennett, C. Partridge, and N. Shectman. Packet Reordering is not Patholog-

ical Network Behavior. IEEE/ACM Transactions on Networking, 7(6):789798,

1999.

[6] J. Cao, William S. Cleveland, Dong Lin, Don X. Sun. On the Non stationarity

of Internet Traffic. Proceedings of ACM SIGMETRICS, pp 102–112, 2001.

[7] P. Crowley, M. Fiuczynski, J.L. Baer, B. Bershad. Characterizing processor

architectures for programmable network interfaces. In Proceedings of Interna-

tional Conference on Supercomputing, Feb 2000.

[8] C. Fraleigh, S. Moon, C. Diot, B. Lyles, and F. Tobagi. Packet-level traffic

measurements from a tier-1 IP backbone. Technical Report TR01-ATL110101,

Sprint ATL Technical Report, November 2001.

88 BIBLIOGRAPHY

[9] M. Franklin, T. Wolf. A Network Processor Performance and Design Model with

Benchmark Parameterization. First Workshop on Network Processors, Cam-

bridge, MA, February 2002.

[10] L. Garber. Denial of Service attacks rip the Internet. IEEE Computer , 33,4:12–

17, Apr. 2000. 2005.

[11] R. Govindarajan, F. Suciu, W. Zuberek. Timed Petri Net Models of Multi-

threaded Multiprocessor Architectures, Proc. of the 7th International Work-

shop on Petri Nets and Performance Models, pp.163-172, Saint Malo, France,

June 1997.

[12] J. Hasan, Satish Chandra, T. N. Vijaykumar. Efficient Use of Memory Band-

width to Improve Network Processor Throughput. In Proceedings of Interna-

tional Symposium on Computer Architecture, June 2003.

[13] IBM. The Network Processor: Enabling Technology for high performance Net-

working. IBM Microelectronics, 1999.

[14] Intel Corporation, Intel IXP 1200 Network Processor Hardware Reference Man-

ual. Revision 8, pp 27-29,102-104, August 2001.

[15] Intel Corporation, Intel IXP 2400 Network Processor Hardware Reference Man-

ual. Revision 7, November 2003.

[16] Intel IXP2400/IXP2800 Development Tools Users Guide. Revision 11, March

2004.

[17] Intel IXP2400/IXP2800 Network Processors Microengine C Language Support

Reference Manual. Revision 9, November 2003.

[18] B. Jacob, D. Wang. DRAM: Architectures, Interfaces, and Systems: A Tutorial

. (http://www.ee.umd.edu/ blj/talks/DRAM-Tutorial-isca2002-2.pdf).

BIBLIOGRAPHY 89

[19] S. Jaiswal, G. Iannaccone, C. Diot, J. Kuorose and D. Towsley. Measuring and

classification of out-of-sequence packets in a Tier-1 IP Backbone, International

Measurement Workshop(IMW), 2003.

[20] K. Jensen. A Brief Introduction to Coloured Petri Nets. In Proceeding of

the TACAS-1997 Workshop, Lecture Notes in Computer Science Vol. 1217,

Springer-Verlag 1997, 203-208.

[21] S. Kent, R. Atkinson. RFC 2402 - IP Authentication Header (AH), November

1998.

[22] S. Kent, R. Atkinson. RFC 2406 - IP Encapsulating Security Payload (ESP),

November 1998.

[23] G. Kramer. Generation of self-similar traffic using Traf Gen 3.

(http://wwwcsif.cs.ucdavis.edu/ kramer/code/trf gen3.html)

[24] M. Laor, L. Gendel. The Effect of Packet Reordering in a Backbone Link on

Application Throughput. IEEE Network Sept/Oct 2002.

[25] W. Leland, M. S. Taqqu, W. Willinger and D. Wilson. On the Self Similar

Nature of Ethernet Traffic. Proc of SIGCOMM, Sept 1993.

[26] Motorola C-5 Network Processor Hardware Reference Manual, Revision 1.7,

October 2001.

[27] Y. Narhari and N. Vishwanadham. Performance Modeling of Automated Man-

ufacturing Systems, Prentice Hall, 1992.

[28] National Laboratory for Applied Network Research (NLANR). Insights into

Current Internet Traffic Workloads.(http://www.nlanr.net/NA/tutorial.html)

[29] T. Spalink, Scott Karlin, Larry Peterson. Evaluating Network Processors for IP

Forwarding. Technical Report TR-626-00, Department of Computer Science,

Princeton University, Nov. 2000.

90 BIBLIOGRAPHY

[30] R. Stevens. TCP/IP Illustrated, Volume 1: The Protocols, Addison-Wesley,

1994.

[31] L. Thiele, Samarjit Chakraborty, Matthias Gries, Simon Kunzli. Design space

exploration of network processor architectures. First Workshop Workshop on

Network Processors, Cambridge, MA, February 2002.

[32] K. Trivedi. Probability and Statistics, with Reliability, Queuing and Computer

Science Applications, Wiley Interscience, 2002.

[33] R. Saavedra-Barrera, D. Culler, and T. Von Eicken. Analysis of multithreaded

architectures for parallel computing. In Second Annual ACM Symposium on

Parallel Algorithms and Architectures, pages 169–178, July 1990.

[34] P. Saisuresh, K. Egevang. RFC 3022 - Traditional IP Network Address Trans-

lator (Traditional NAT), January 2001.

[35] M. S. Taqqu, Walter Willinger, and R. Sherman. Proof of a Fundamental Result

in Self-Similar Traffic Modeling. Proceedings of ACM SIGCOMM Computer

Communication Review, vol. 27, pp. 5-23, 1997.

[36] R. Thayer, N. Doraswamy, R. Glenn. RFC 2411 - IP Security Document

Roadmap, November 1998.

[37] N. Weng and T. Wolf. Pipelining vs multiprocessing: Choosing the right net-

work proecessor topology. Proceedings of Advanced Networking and Communi-

cation Hardware Workshop (ANCHOR 2004) in conjunction with the 31st An-

nual International Symposium on Computer Architecture (ISCA 2004), June

2004.

[38] T. Wolf. and M. Franklin. Commbench : A Telecommunication benchmark

for Network Processors. In Proceedings of the International Symposium on

Performance Analysis of Systems and Software, April 2000, pp. 154-162.

BIBLIOGRAPHY 91

[39] Y. Zhang, L. Breslau, V. Paxson and S. Shenker. On the Characteristics and

Origins of Internet Flow Rates. In Proceedings of SIGCOMM, August 2002.

[40] W. M. Zuberek. Modeling using Timed Petri Nets - event-driven simulation,

Technical Report No. 9602, Dept. of Computer Science, Memorial Univ. of New-

foundland, St. John’s, Canada, 1996 (ftp://ftp.ca.mun.ca/pub/techreports/tr-

9602.ps.Z).

performance modeling and evaluation of network...

Documents