ip lookup arvind computer science & artificial intelligence lab

30
February 17, 2009 http:// csg.csail.mit.edu/arvind L06-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology

Upload: rafael-duke

Post on 04-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology. Line Card (LC). Arbitration. Packet Processor. Control Processor. SRAM (lookup table). Switch. Queue Manager. IP Lookup. Exit functions. LC. LC. LC. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 http://csg.csail.mit.edu/arvind L06-1

IP Lookup

Arvind Computer Science & Artificial Intelligence LabMassachusetts Institute of Technology

Page 2: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-2http://csg.csail.mit.edu/arvind

IP Lookup block in a router

QueueManager

Packet Processor

Exit functions

ControlProcessor

Line Card (LC)

IP Lookup

SRAM(lookup table)

Arbitration

Switch

LC

LC

LC

A packet is routed based on the “Longest Prefix Match” (LPM) of it’s IP address with entries in a routing tableLine rate and the order of arrival must be maintained line rate 15Mpps for 10GE

Page 3: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-3http://csg.csail.mit.edu/arvind

18

2

3

IP address Result M Ref

7.13.7.3 F

10.18.201.5 F

7.14.7.2

5.13.7.2 E

10.18.200.7 C

Sparse tree representation

3

A…

A…

B

C…

C…

5 D

F…

F…

14

A…

A…

7

F…

F…

200

F…

F…

F*

E5.*.*.*

D10.18.200.5

C10.18.200.*

B7.14.7.3

A7.14.*.* F…F…

F

F…

E5

7

10

255

0

14

4A In this lecture:Level 1: 16 bits Level 2: 8 bits Level 3: 8 bits

1 to 3 memory accesses

Page 4: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-4http://csg.csail.mit.edu/arvind

“C” version of LPMintlpm (IPA ipa) /* 3 memory lookups */{ int p;

/* Level 1: 16 bits */ p = RAM [ipa[31:16]]; if (isLeaf(p)) return value(p);

/* Level 2: 8 bits */ p = RAM [ptr(p) + ipa [15:8]]; if (isLeaf(p)) return value(p);

/* Level 3: 8 bits */ p = RAM [ptr(p) + ipa [7:0]];

return value(p); /* must be a leaf */}

Not obvious from the C code how to deal with - memory latency - pipelining

216 -1

0

…28 -1

0

28 -1

0

Must process a packet every 1/15 s or 67 ns

Must sustain 3 memory dependent lookups in 67 ns

Memory latency ~30ns to 40ns

Page 5: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-5http://csg.csail.mit.edu/arvind

Longest Prefix Match for IP lookup:3 possible implementation architectures

Rigid pipeline

Inefficient memory usage but simple design

Linear pipeline

Efficient memory usage through memory port replicator

Circular pipeline

Efficient memory with most complex control

Designer’s Ranking:

1 2 3Which is “best”?

Arvind, Nikhil, Rosenband & Dave ICCAD 2004

Page 6: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-6http://csg.csail.mit.edu/arvind

Circular pipeline

The fifo holds the request while the memory access is in progress

The architecture has been simplified for the sake of the lecture. Otherwise, a “completion buffer” has to be added at the exit to make sure that packets leave in order.

enter?enter?done?done?RAM

yesinQ

fifo

no

outQ

Page 7: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-7http://csg.csail.mit.edu/arvind

interface FIFO#(type t); method Action enq(t x); // enqueue an item method Action deq(); // remove oldest entry method t first(); // inspect oldest itemendinterface

FIFO

n = # of bits needed to represent a value of type t

not full

not empty

not empty

rdyenab

n

n

rdyenab

rdy

enq

deq

first

FIFO

module

Page 8: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-8http://csg.csail.mit.edu/arvind

Addr

Readyctr

(ctr > 0) ctr++

ctr--

deq

Enableenq

Request-Response Interface for Synchronous Memory

Synch MemLatency N

interface Mem#(type addrT, type dataT);method Action req(addrT x);

method Action deq(); method dataT peek();endinterface

Data

Ack

DataReady

req

deq

peek

Making a synchronous component latency- insensitive

Page 9: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-9http://csg.csail.mit.edu/arvind

Circular Pipeline Code rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq();endrule

enter?enter?done?done?RAM

inQ

fifo

rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq();endrule

When can enter fire?

inQ has an element and ram & fifo each has space

done? Is the same as isLeaf

Page 10: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-10http://csg.csail.mit.edu/arvind

Circular Pipeline Code: discussionrule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq();endrule

enter?enter?done?done?RAM

inQ

fifo

rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq();endrule

When can recirculate fire?

ram & fifo each has an element and ram, fifo & outQ each has space

Is this possible?

Page 11: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-11http://csg.csail.mit.edu/arvind

module mkFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); method Action enq(t x) if (!full); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; endmethod method t first() if (full); return (data); endmethod method Action clear(); full <= False; endmethodendmodule

One Element FIFOenq and deq cannot even be enabled together much less fire concurrently!

n

not empty

not full rdyenab

rdyenab

enq

deq

FIFO

module

The functionality we want is as if deq “happens” before enq; if deq does not happen then enq behaves normally

We can build such a FIFO

more on this later

Page 12: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-12http://csg.csail.mit.edu/arvind

Dead cyclesrule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq();endrule

enter?enter?done?done?RAM

inQ

fifo

rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq();endrule

Can a new request enter the system when an old one is leaving?

Is this worth worrying about?

assume simultaneous enq & deq is allowed

Page 13: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-13http://csg.csail.mit.edu/arvind

The Effect of Dead Cycles

enterenterdone?done?RAM

yesin

fifo

no

What is the performance loss if “exit” and “enter” don’t ever happen in the same cycle?

>33% slowdown! Unacceptable

Circular Pipeline RAM takes several cycles to respond to a request Each IP request generates 1-3 RAM requests FIFO entries hold base pointer for next lookup and

unprocessed part of the IP address

Page 14: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-14http://csg.csail.mit.edu/arvind

The compiler issueCan the compiler detect all the conflicting conditions?

Important for correctness

Does the compiler detect conflicts that do not exist in reality?

False positives lower the performance The main reason is that sometimes the compiler

cannot detect under what conditions the two rules are mutually exclusive or conflict free

What can the user specify easily? Rule priorities to resolve nondeterministic choice

yes

yes

In many situations the correctness of the design is not enough; the design is not done unless the performance goals are met

Page 15: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-15http://csg.csail.mit.edu/arvind

Scheduling conflicting rules

When two rules conflict on a shared resource, they cannot both execute in the same clockThe compiler produces logic that ensures that, when both rules are applicable, only one will fire Which one? source annotations

(* descending_urgency = “recirculate, enter” *)

Page 16: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-16http://csg.csail.mit.edu/arvind

So is there a dead cycle? rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq();endrule

enter?enter?done?done?RAM

inQ

fifo

rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq();endrule

In general these two rules conflict but when isLeaf(p) is true there is no apparent conflict!

Page 17: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-17http://csg.csail.mit.edu/arvind

Rule Splitingrule foo (True); if (p) r1 <= 5; else r2 <= 7;endrule

rule fooT (p); r1 <= 5;endrule

rule fooF (!p); r2 <= 7;endrule

rule fooT and fooF can be scheduled independently with some other rule

Page 18: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-18http://csg.csail.mit.edu/arvind

Spliting the recirculate rulerule recirculate (!isLeaf(ram.peek())); IP rip = fifo.first(); fifo.enq(rip << 8); ram.req(ram.peek() + rip[15:8]); fifo.deq(); ram.deq();endrule

rule exit (isLeaf(ram.peek())); outQ.enq(ram.peek()); fifo.deq(); ram.deq();endrule

rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq();endrule

Now rules enter and exit can be scheduled simultaneously, assuming fifo.enq and fifo.deq can be done simultaneously

Page 19: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-19http://csg.csail.mit.edu/arvind

module mkFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); method Action enq(t x) if (!full); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; endmethod method t first() if (full); return (data); endmethod method Action clear(); full <= False; endmethodendmodule

Back to the fifo problem

n

not empty

not full rdyenab

rdyenab

enq

deq

FIFO

module

The functionality we want is as if deq “happens” before enq; if deq does not happen then enq behaves normally

Page 20: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-20http://csg.csail.mit.edu/arvind

RWire to rescue

interface RWire#(type t);method Action wset(t x);method Maybe#(t) wget();

endinterface

Like a register in that you can read and write it but unlike a register

- read happens after write- data disappears in the next cycle

RWires can break the atomicity of a rule if not used properly

Page 21: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-21http://csg.csail.mit.edu/arvind

module mkLFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); RWire#(void) deqEN <- mkRWire(); method Action enq(t x) if

(!full || isValid (deqEN.wget())); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; deqEN.wset(?); endmethod method t first() if (full); return (data); endmethod method Action clear(); full <= False; endmethodendmodule

One Element “Loopy” FIFO

not empty

not full rdyenab

rdyenab

enq

deq

FIFO

module

or

!full

This works correctly in both cases (fifo full and fifo empty).

Page 22: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-22http://csg.csail.mit.edu/arvind

Problem solved!

rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq();endrule

LFIFO fifo <- mkLFIFO; // use a loopy fifo

RWire has been safely encapsulated inside the Loopy FIFO – users of Loopy fifo need not be aware of RWires

Page 23: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-23http://csg.csail.mit.edu/arvind

Packaging a module:Turning a rule into a method

inQ

enter?enter?done?done?RAM

fifo

rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(p[15:0]); inQ.deq();endrule

method Action enter (IP ip); ram.req(ip[31:16]); fifo.enq(ip[15:0]);endmethod

outQ

Similarly a method can be written to extract elements from the outQ

Page 24: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-24http://csg.csail.mit.edu/arvind

Circular pipeline with Completion Buffer

luReq

luResp

Completion buffer- gives out tokens to control the entry into the circular pipeline- ensures that departures take place in order even if lookups complete out-of-order

The fifo holds the token while the memory access is in progress: Tuple2#(Bit#(16), Token)

enter?enter?done?done?RAM

cbufyes

getToken

inQ

fifo

no

remainingIP

Page 25: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-25http://csg.csail.mit.edu/arvind

Circular Pipeline Codewith Completion Buffer

rule enter (True); Token tok <- cbuf.getToken(); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(tuple2(ip[15:0], tok)); inQ.deq();endrule

enter?enter?done?done?RAM

cbufinQ

fifo

rule recirculate (True); TableEntry p <- ram.resp(); match {.rip, .tok} = fifo.first(); if (isLeaf(p)) cbuf.put(tok, p); else begin fifo.enq(tuple2(rip << 8, tok)); ram.req(p+rip[15:8]); end fifo.deq();endrule

Page 26: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-26http://csg.csail.mit.edu/arvind

Completion bufferinterface CBuffer#(type t); method ActionValue#(Token) getToken(); method Action put(Token tok, t d); method ActionValue#(t) getResult();endinterface

module mkCBuffer (CBuffer#(t)) provisos (Bits#(t,sz)); RegFile#(Token, Maybe#(t)) buf <- mkRegFileFull(); Reg#(Token) i <- mkReg(0); //input index Reg#(Token) o <- mkReg(0); //output index Reg#(Int#(32)) cnt <- mkReg(0); //number of filled slots…

I

I

V

I

V

Icnt

i

o

buf

Elements must be representable as bits

typedef Bit#(TLog#(n)) TokenN#(numeric type n);typedef TokenN#(16) Token;

Page 27: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-27http://csg.csail.mit.edu/arvind

Completion buffer

// state elements // buf, i, o, n ... method ActionValue#(t) getToken() if (cnt < maxToken); cnt <= cnt + 1; i <= i + 1; buf.upd(i, Invalid); return i; endmethod

method Action put(Token tok, t data); return buf.upd(tok, Valid data); endmethod

method ActionValue#(t) getResult() if (cnt > 0) &&& (buf.sub(o) matches tagged (Valid .x)); o <= o + 1; cnt <= cnt - 1; return x; endmethod

I

I

V

I

V

Icnt

i

o

buf

Home work: Think about concurrency Issues, i.e., can these methods be executed concurrently? Do they need to?

Page 28: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-28http://csg.csail.mit.edu/arvind

Longest Prefix Match for IP lookup:3 possible implementation architectures

Rigid pipeline

Inefficient memory usage but simple design

Linear pipeline

Efficient memory usage through memory port replicator

Circular pipeline

Efficient memory with most complex control

Which is “best”?Arvind, Nikhil, Rosenband & Dave ICCAD 2004

Page 29: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-29http://csg.csail.mit.edu/arvind

Implementations of Static pipelines

Two designers, two results

LPM versions Best Area(gates)

Best Speed(ns)

Static V (Replicated FSMs) 8898 3.60

Static V (Single FSM) 2271 3.56

Replicated:

RAM

FSM

MUX / De-MUX

FSM FSM FSM

Counter

MUX / De-MUX

resultIP addr

FSM

RAM

MUX

result

IP addr

BEST:

Each packet is processed by one FSM

Shared FSM

Page 30: IP Lookup Arvind  Computer Science & Artificial Intelligence Lab

February 17, 2009 L06-30http://csg.csail.mit.edu/arvind

Synthesis resultsLPM versions

Code size(lines)

Best Area(gates)

Best Speed(ns)

Mem. util. (random workload)

Static V 220 2271 3.56 63.5%

Static BSV 179 2391 (5% larger) 3.32 (7% faster) 63.5%

Linear V 410 14759 4.7 99.9%

Linear BSV 168 15910 (8% larger) 4.7 (same) 99.9%

Circular V 364 8103 3.62 99.9%

Circular BSV 257 8170 (1% larger) 3.67 (2% slower) 99.9%

Synthesis: TSMC 0.18 µm lib

V = Verilog;BSV = Bluespec System Verilog

- Bluespec results can match carefully coded Verilog- Micro-architecture has a dramatic impact on performance- Architecture differences are much more important than language differences in determining QoR