network processor algorithms: design and analysis

46
Network Processor Algorithms: Design and Analysis Stochastic Networks Conference Montreal July 22, 2004 Balaji Prabhakar Balaji Prabhakar Stanford University

Upload: derick

Post on 02-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Network Processor Algorithms: Design and Analysis. Stochastic Networks Conference Montreal July 22, 2004. Balaji Prabhakar. Balaji Prabhakar Stanford University. Overview. Network Processors What are they? Why are they interesting to industry and to researchers - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Network Processor Algorithms:  Design and Analysis

Network Processor Algorithms: Design and Analysis

Stochastic Networks Conference

Montreal

July 22, 2004

High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.

Balaji PrabhakarBalaji Prabhakar

Stanford University

Page 2: Network Processor Algorithms:  Design and Analysis

2

Overview

• Network Processors – What are they?

– Why are they interesting to industry and to researchers

• SIFT: a simple algorithm for identifying large flows– The algorithm and its uses

• Traffic statistics counters– The basic problem and algorithms

• Sharing processors and buffers– A cost/benefit analysis

Page 3: Network Processor Algorithms:  Design and Analysis

3

Cisco GSR 12416 Juniper M160

6ft

19”

2ft

Capacity: 160Gb/sPower: 4.2kW

3ft

2.5ft

19”

Capacity: 80Gb/sPower: 2.6kW

Capacity is sum of rates of line-cards

IP Routers

2.5ft

Page 4: Network Processor Algorithms:  Design and Analysis

4

A Detailed Sketch

NetworkProcessor

LookupEngine

NetworkProcessor

Lookup Engine

NetworkProcessor

LookupEngine

InterconnectionFabric

Switch

Output Scheduler

Line cards Outputs

PacketBuffers

PacketBuffers

PacketBuffers

Page 5: Network Processor Algorithms:  Design and Analysis

5

Network Processors

• Network processors are an increasingly important component of IP routers

• They perform a number of tasks (essentially everything except Switching and Route lookup)– Buffer management– Congestion control– Output scheduling– Traffic statistics counters– Security …

• They are programmable, hence add great flexibility to a router’s functionality

Page 6: Network Processor Algorithms:  Design and Analysis

6

Network Processors

• But, because they operate under severe constraints– very high line rates – heat constraints

the algorithms that they can support should be lightweight

• They have become very attractive to industry

• They give rise to some interesting algorithmic and performance analytic questions

Page 7: Network Processor Algorithms:  Design and Analysis

7

Rest Of The Talk

SIFT: a simple algorithm for identifying large flows– The algorithm and its uses with Arpita Ghosh and Costas Psounis

• Traffic statistics counters– The basic problem and algorithms with Sundar Iyer, Nick McKeown and Devavrat Shah

• Sharing processors and buffers– A cost/benefit analysis with Vivek Farias and Ciamac Moallemi

Page 8: Network Processor Algorithms:  Design and Analysis

8

SIFT: Motivation

• Current egress buffers on router line cards serve packets in a FIFO manner

• But, giving the packets of short flows a higher priority, e.g. using the SRPT (Shortest Remaining Processing Time) policy – reduces average flow delay

– given the heavy-tailed nature of Internet flow size distribution, the reduction in delay can be huge

Egress BufferEgress Buffer FIFO:FIFO:

SRPT:SRPT:

Page 9: Network Processor Algorithms:  Design and Analysis

9

But …

• SRPT is unimplementable– router needs to know residual flow sizes for all enqueued flows: virtually

impossible to implement

• Other pre-emptive schemes like SFF (shortest flow first) or LAS (least attained service) are like-wise too complicated to implement

• This has led researchers to consider tagging flows at the edge, where the number of distinct flows is much smaller– but, this requires a different design of edge and core routers– more importantly, needs extra space on IP packet headers to signal flow

size

• Is something simpler possible?

Page 10: Network Processor Algorithms:  Design and Analysis

10

SIFT: A randomized algorithm

• Flip a coin with bias p (= 0.01, say) for heads on each arriving packet, independently from packet to packet

• A flow is “sampled” if one its packets has a head on it

HHTTTT TT TT TT HH

Page 11: Network Processor Algorithms:  Design and Analysis

11

SIFT: A Randomized Algorithm

• A flow of size X has roughly 0.01X chance of being sampled– flows with fewer than 15 packets are sampled with prob 0.15– flows with more than 100 packets are sampled with prob 1 – the precise probability is: 1 – (1-0.01)X

• Most short flows will not be sampled, most long flows will be

Page 12: Network Processor Algorithms:  Design and Analysis

12

• Ideally, we would like to sample like the blue curve

• Sampling with prob p gives the red curve– there are false positives and false negatives

• Can we get the green curve?

The Accuracy of Classification

Flow size

Pro

b (

sam

ple

d)

Page 13: Network Processor Algorithms:  Design and Analysis

13

• Sample with a coin of bias q = 0.1– say that a flow is “sampled” if it gets two heads!– this reduces the chance of making errors– but, you have to have a count the number heads

• So, how can we use SIFT at a router?

SIFT+

Page 14: Network Processor Algorithms:  Design and Analysis

14

SIFT at a router

• Sample incoming packets• Place any packet with a head (or the second such packet) in the low

priority buffer• Place all further packets from this flow in the low priority buffer (to avoid

mis-sequencing)

All flows

B

Short flows

B/2

Long flows

B/2sampling

Page 15: Network Processor Algorithms:  Design and Analysis

15

• Topology:

Simulation results

Traffic

SourcesTrafficSinks

Page 16: Network Processor Algorithms:  Design and Analysis

16

Overall Average Delays

Page 17: Network Processor Algorithms:  Design and Analysis

17

Average Delay for Short Flows

Page 18: Network Processor Algorithms:  Design and Analysis

18

Average Delay for Long Flows

Page 19: Network Processor Algorithms:  Design and Analysis

19

• SIFT needs– two logical queues in one physical buffer– to sample arriving packets – a table for maintaining id of sampled flows– to check whether incoming packet belongs to sampled

flow or not all quite simple to implement

Implementation Requirements

Page 20: Network Processor Algorithms:  Design and Analysis

20

• The buffer of the short flows has very low occupancy– so, can we simply reduce it drastically without sacrificing

performance?

• More precisely, suppose– we reduce the buffer size for the small flows, increase it

for the large flows, keep the total the same as FIFO

A Big Bonus

Page 21: Network Processor Algorithms:  Design and Analysis

SIFT Incurs Fewer Drops

Buffer_Size(Short flows) = 10; Buffer_Size(Long flows) = 290;Buffer_Size(Single FIFO Queue) = 300;

SIFT ------ FIFO ------

Page 22: Network Processor Algorithms:  Design and Analysis

22

• Suppose we reduce the buffer size of the long flows as well

• Questions:– will packet drops still be fewer?– will the delays still be as good?

Reducing Total Buffer Size

Page 23: Network Processor Algorithms:  Design and Analysis

Drops With Less Total Buffer

Buffer_Size(PRQ0) = 10; Buffer_Size(PRQ1) = 190;Buffer_Size(One Queue) = 300;

One Queue SIFT ------ FIFO ------

Page 24: Network Processor Algorithms:  Design and Analysis

Delay Histogram for Short Flows

SIFT ------ FIFO ------

0

0.05

0.1

0.15

0.2

0.25

0.3

0.00 0.30 0.61 0.91

Transfer Time (sec)

Per

cen

tag

e

SIFT

FIFO

Page 25: Network Processor Algorithms:  Design and Analysis

Delay Histogram for Long Flows

SIFT ------ FIFO ------

0

0.01

0.02

0.03

0.04

0.05

0.06

0.00 0.30 0.61 0.91

Transfer Time (sec)

Pe

rce

nta

ge

SIFT

FIFO

Page 26: Network Processor Algorithms:  Design and Analysis

26

• The amount of buffering needed to keep links fully utilized– old formula: = 10 Gbps x 0.25 = 2.5 G– corrected to: ¼ 250 M

• But, this formula is for large (elephant) flows, not for short (mice) flows– elephant arrival rate: 0.65 or 0.7 of C; hence they smaller buffers for them

– mice buffers are almost empty due to high priority, mice don’t cause elephant packet drops

– elephants use TCP to regulate their sending rate according to

Why SIFT Reduces Buffers

mice

elephants

SIFT

Page 27: Network Processor Algorithms:  Design and Analysis

27

• A randomized scheme, preliminary results show that– it has a low implementation complexity– it reduces delays drastically (users are happy)– with 30-35% smaller buffers at egress line cards (router manufacturers are happy)

• Leads to a 15 pkts or less lane on the Internet, could be useful

• Further work needed– at the moment we have a good understanding of how to sample, and extensive (and encouraging) simulation tests– need to understand the effect of reduced buffers on end-to-end

congestion control algorithms

Conclusions for SIFT

Page 28: Network Processor Algorithms:  Design and Analysis

28

Traffic Statistics Counters: Motivation

• Switches maintain statistics, typically using counters that are incremented when packets arrive

• At high line rates, memory technology is a limiting factor for the implementation of counters; for example, in a 40 Gb/s switch, each packet must be processed in 8 ns

• To maintain a counter per flow at these line rates, we would like an architecture with the speed of SRAM, and the density (size) of DRAM

Page 29: Network Processor Algorithms:  Design and Analysis

29

CounterManagement

Algorithm

SRAM

Arrivals(at most oneper time slot)

N counters

DRAM

Update counter in DRAM, emptycorresponding counter in SRAM(once every b time slots)

Hybrid Architecture

• Shah, Iyer, Prabhakar, and McKeown (2001) proposed a hybrid SRAM/DRAM architecture

Page 30: Network Processor Algorithms:  Design and Analysis

30

Counter Management Algorithm

• Shah et al. place a requirement on the counter management algorithm (CMA) that it must maintain all counter values accurately

• That is, given N and b, what should the size of each SRAM counter be so that no counts are missed?

Page 31: Network Processor Algorithms:  Design and Analysis

31

Some CMAs

• Round robin– maximum counter value is bN

• Largest Counter First (LCF)– optimal in terms of SRAM memory usage– no counter can have a value larger than:

Page 32: Network Processor Algorithms:  Design and Analysis

32

Analysis of LCF

• This upper bound is proved by establishing a bound on the following potential (Lyapunov) function– let Qi(t) be the size of counter i at time t, then

• Hence, the size of the largest counter is at most

• E.g. for b = 2,

Page 33: Network Processor Algorithms:  Design and Analysis

33

An Implementable Algorithm

• LCF is difficult to implement– with one counter per flow, we would like to support at least 1

million counters– maintaining a sorted list of counters to determine the longest

counter takes too much SRAM memory

• Ramabhadran and Varghese (2003) proposed a simpler algorithm with the same memory usage as LCF

Page 34: Network Processor Algorithms:  Design and Analysis

34

LCF with Threshold

• The algorithm keeps track of the counters that have value at least as large as b

• At any service time, let j be the counter with the largest value among those incremented since the previous service, and let c be its value– if c ¸ b, serve counter j– if c · b, serve any counter with value at least b; if no such

counter exists, serve counter j

• Maintaining the counters with values at least b is a non-trivial problem; it is solved using a bitmap and an additional data structure

• Is something even simpler possible?

Page 35: Network Processor Algorithms:  Design and Analysis

35

Some Simpler Algorithms …

• Possible approaches for a CMA that is simpler to implement:– arrival information (serve largest counter among those

incremented)– random sampling– round-robin pointer

• Trade-off between simplicity and performance: more SRAM is needed in the worst case for these schemes

Page 36: Network Processor Algorithms:  Design and Analysis

36

An Alternative Architecture

• Decision problem: given a counter with a particular value and the occupancy of the buffer, when should the counter value be moved to the FIFO buffer? What size counters does this lead to?

– Interesting question with Poisson arrivals, exponential services, tractable

CounterManagement

Algorithm

SRAMN counters

DRAM

… …

FIFO Buffer

Page 37: Network Processor Algorithms:  Design and Analysis

37

The Cost of Sharing

• We have seen that there is a very limited amount of buffering and processing capability in each line card

• In order to fully utilize these resources, it will become necessary to share them amongst the packets arriving at each line card

• But, sharing imposes a cost– we may need to traverse the switch fabric more often than needed– each of the two processors involved in a migration will need to do some processing; e.g. local,

1 remote, instead of just 1 – or, the host processor may simply be worse at the processing; e.g. 1 local versus K (> 1) remote

• Need to understand the tradeoff between costs and benefits– will focus on a specific queueing model– interested in simple rules – benefit measured in reduction of backlogs

Page 38: Network Processor Algorithms:  Design and Analysis

38

The Setup

• Does sharing reduce backlogs?

Poisson () Poisson () Poisson () Poisson ()

1

K

1

exp(1) exp(1) exp(1) exp(1)

Page 39: Network Processor Algorithms:  Design and Analysis

39

• Job arrives at queue 1

• Send the job to queue 2 if

• Otherwise, keep the job in queue 1

• Analogous policy for jobs arriving at queue 2

Additive Threshold Policy

Page 40: Network Processor Algorithms:  Design and Analysis

40

0.00001

0.0001

0.001

0.01

0.1

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Additive Thresholds - Queue Tails

No Sharing

Page 41: Network Processor Algorithms:  Design and Analysis

41

• Theorem: Additive policy is stable if

and unstable if

• For example, if

Additive Thresholds - Stability

Stable forUnstable for

Page 42: Network Processor Algorithms:  Design and Analysis

42

• The pros/cons of sharing– Reduction in backlogs– Loss of throughput

Inference

Page 43: Network Processor Algorithms:  Design and Analysis

43

• Job arrives at queue 1

• Send the job to queue 2 if

• Otherwise, keep the job in queue 1

Multiplicative Threshold Policy

• Theorem: Multiplicative policy is stable for all < 1

• Interestingly, this policy improves delays while preserving throughput!

Page 44: Network Processor Algorithms:  Design and Analysis

44

0.00001

0.0001

0.001

0.01

0.1

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Multiplicative Thresholds - Queue Tails

No Sharing

Page 45: Network Processor Algorithms:  Design and Analysis

45

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

1 1.5 2 2.5 3 3.5 4 4.5 5

Multiplicative Thresholds - DelayA

vera

ge

De

lay

Page 46: Network Processor Algorithms:  Design and Analysis

46

Conclusions

• Network processors add useful features to a router’s function

• There are many algorithmic questions that come up– simple, high performance algorithms are needed

• For the theorist, there are many new and interesting questions; we have seen three examples briefly– SIFT: a sampling algorithm– Designing traffic statistics counters– Sharing: a cost-benefit analysis