area amministrazione e finanza area ...fraction of data hazards (fdh) per batch of 100k packets...

Partly funded by EU H2020 projects:

Relaxing constraints in stateful network data plane design

Carmelo Cascone^, Roberto Bifulco*, Salvatore Pontarelli+, Antonio Capone^^ Politecnico di Milano (Italy), *NEC Laboratories Europe (Germany), +Univ. Roma Tor Vergata (Italy)

“BEBA” grant agreement 644122 “VirtuWind” grant agreement 671648

AREA SERVIZI AGLI STUDENTI E AI DOTTORANDI

AREA SERVIZI AGLI STUDENTI E AI DOTTORANDI

AREA SVILUPPO E RAPPORTI CON LE IMPRESE

AREA SVILUPPO E RAPPORTI CON LE IMPRESE

AREA RISORSE UMANE E ORGANIZZAZIONE

AREA RISORSE UMANE E ORGANIZZAZIONE

DIREZIONE GENERALE

AREA AMMINISTRAZIONE E FINANZA

DIREZIONE GENERALE

AREA AMMINISTRAZIONE E FINANZA

AREA GESTIONE INFRASTRUTTURE E SERVIZI

AREA GESTIONE INFRASTRUTTURE E SERVIZI

AREA COMUNICAZIONE E RELAZIONI ESTERNE

AREA COMUNICAZIONE E RELAZIONI ESTERNE

6.0 | GERARCHIE FUNZIONALI | SCUOLA / DIPARTIMENTO / AMMINISTRAZIONE

1. INTRODUCTIONProgram and run at line rate algorithms that read and modify data plane’s state

§ E.g.: Stateful firewall, dynamic NAT, flowlet load balancing, AQM, measurement, etc.

State

f(pkt, state)pkt

State of the artØ Reconfigurable Match Table (RMT) [SIGCOMM 2013]

§ Programmable parser, actions, table size, stateless§ 640 Gb/s non-blocking line rate, 6.5 Tb/s Tofino Chip

Ø Domino/Banzai [SIGCOMM 2016]§ Extends RMT with support for stateful actions named “Atoms”§ Atoms can be pipelined to implement complex algorithms§ Strict constraints on the atom execution time§ State read-modify-write must be executed in 1 clock tick

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000 1200 1400

CD

F

Bytes

mawi-15chi-15

sj-12fb-web

Packet size distribution

3. OBSERVATIONS1. Pipeline’s header processing rate depends on packet size

§ Packets are read from input ports in chunks, e.g. 80 bytes in RMT§ 80 bytes * 1 Ghz (chip clock freq.) = 640 Gb/s line rate

§ Larger packets cause inter-packet idle cycles➞ Minimize risk of data hazards

2. Distinguish between per-flow and global state§ Global: shared among all packets§ Per-flow: shared by packets of the same flow

➞ Different flows can be processed in parallel§ E.g. a stateful firewall needs only per-flow state, a DNAT both

4. MOTIVATING EXPERIMENTSEvaluate the risk of data hazards using simulations with real traffic traces

Hea

ders

…Pk

tdat

a………… Data-path width

E.g. 1280 bit in RMT(640 headers + 640 metadata)

Hea

ders

…Pk

tdat

a…

Hea

ders

Hea

ders

Hea

ders

…

Bits

PipelineMinimum size

packetsInter-packet idle cycles

(e.g. in RMT 18 with 1500 bytes packet)

0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24

0 5 10 15 20 25 30

FDH

(99t

h pe

rc.)

Pipeline depth (clock cycles)

mawi-15 5-tupleipdst

ipdst/16

chi-15 5-tupleipdst

ipdst/16

sj-12 5-tupleipdst

ipdst/16

0

0.1

0.2

0.3

0.4

0 5 10 15 20 25 30

FDH

(99t

h pe

rc.)

mawi-15 chi-15 sj-12 fb-web

0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24

0 5 10 15 20 25 30

FDH

(99t

h pe

rc.)


mawi-15 5-tupleipdst

ipdst/16

chi-15 5-tupleipdst

ipdst/16

sj-12 5-tupleipdst

ipdst/16

Simulation results:Fraction of data hazards (FDH) per batch of 100k packets§ Pipeline’s read chunk = 80 bytes§ All packets back-to-back

i.e. 100% pipeline utilization

Trace Provider Description Date Num pktsNum flows per 1m pkts

5-tuple ipdst ipdst/16

chi-15 CAIDA 10Gb/s backbone link in Chicago.Usual conditions.

Feb 19, 2015 3.5b 100.6k 57.7k 4.6k

sj-12 CAIDA 10Gb/s backbone link in San Jose.Unusually high number of 5-tuples.

Nov 15, 2012 3.6b 249k 17k 2k

mawi-15 MAWI 1Gb/s backbone link in Japan.High volume of anomalous traffic.

Jul 21, 2015 135m 40.8k 17.3k 1.7k

fb-web Facebook Packet samples from 10 most active ToR switches in a web cluster

2015 447m n/a n/a n/a

Global state Per-flow state

5. APPROACH: MEMORY LOCKINGIf two packets of the same flow arrive back-to-back, processing is paused for the second packet until the first one has left the stage pipeline.

Table 2: Clock cycle budget (and latency) when using memory lockingMaximum number of clock cycles (up to 30) per processing function, to sustain a given throughput. In all cases W = 4 bits. “Global”

represents the case when packets need to access global state. Latency values are given for 1 Ghz clock frequency, i.e. 1 clock cycle = 1 ns.chi-15 sj-12 mawi-15 fb-web

Thrpt Qlen Q 5-tuple ipdst ipdst/16 global 5-tuple ipst ipdst/16 global 5-tuple ipdst ipdst/16 global global

100%

101 1 1 1 1 1 1 1 1 1 1 1 1 14 1 1 1 1 1 1 1 1 18 1 1 1 1 1 1 1 1 116 1 1 1 1 1 1 1 1 1

1001 20 (174ns) 20 (190ns) 21 (230ns) 8 (282ns) 4 (49ns) 1 1 1 2 (18ns) 2 (20ns) 2 (35ns) 1 2 (10ns)4 30 (175ns) 30 (192ns) 30 (320ns) 8 (152ns) 1 1 2 (12ns) 2 (14ns) 2 (25ns)8 30 (137ns) 30 (144ns) 30 (259ns) 8 (133ns) 1 1 2 (12ns) 2 (14ns) 2 (25ns)16 30 (122ns) 30 (126ns) 30 (221ns) 8 (123ns) 1 1 2 (11ns) 2 (14ns) 2 (24ns)

99.9%

101 8 (16ns) 8 (16ns) 8 (18ns) 4 (18ns) 2 (5ns) 1 1 1 1 1 1 1 2 (10ns)4 14 (33ns) 14 (31ns) 14 (38ns) 2 (4ns) 1 1 1 1 18 16 (39ns) 15 (30ns) 16 (44ns) 2 (4ns) 1 1 1 1 116 17 (37ns) 18 (43ns) 16 (42ns) 2 (5ns) 1 1 1 1 1

1001 27 (568ns) 27 (618ns) 26 (605ns) 8 (282ns) 6 (143ns) 2 (86ns) 2 (84ns) 1 3 (42ns) 3 (48ns) 3 (100ns) 2 (60ns) 2 (10ns)4 30 (175ns) 30 (192ns) 30 (320ns) 15 (526ns) 2 (79ns) 2 (77ns) 4 (41ns) 4 (52ns) 4 (135ns)8 30 (137ns) 30 (144ns) 30 (259ns) 22 (731ns) 2 (79ns) 2 (78ns) 4 (38ns) 4 (50ns) 4 (131ns)16 30 (122ns) 30 (126ns) 30 (221ns) 25 (741ns) 2 (79ns) 2 (72ns) 4 (37ns) 4 (49ns) 4 (129ns)

99%

101 21 (80ns) 21 (80ns) 21 (89ns) 7 (60ns) 3 (14ns) 1 1 1 2 (13ns) 2 (14ns) 1 1 2 (10ns)4 30 (142ns) 30 (148ns) 30 (184ns) 10 (45ns) 1 1 5 (34ns) 4 (31ns) 2 (18ns)8 30 (129ns) 30 (138ns) 30 (184ns) 11 (47ns) 1 1 6 (42ns) 4 (31ns) 2 (18ns)16 30 (116ns) 30 (122ns) 30 (180ns) 12 (47ns) 1 1 7 (52ns) 4 (31ns) 2 (18ns)

1001 30 (950ns) 30 (922ns) 30 (1.1us) 9 (842ns) 8 (268ns) 2 (86ns) 2 (84ns) 2 (171ns) 9 (422ns) 8 (380ns) 5 (316ns) 4 (379ns) 2 (10ns)4 30 (175ns) 30 (192ns) 30 (320ns) 22 (1.1us) 3 (285ns) 3 (283ns) 24 (1.7us) 17 (1.3us) 7 (572ns)8 30 (137ns) 30 (144ns) 30 (259ns) 30 (1.9us) 3 (290ns) 3 (292ns) 30 (2.1us) 23 (2.2us) 8 (753ns)16 30 (122ns) 30 (126ns) 30 (221ns) 30 (940ns) 3 (296ns) 3 (293ns) 30 (1.3us) 25 (2.5us) 8 (759ns)

and the latency as the number of clock cycles from when the packetis completely received to when it is served by the scheduler, i.e. itenters the function pipeline. For simplicity we consider that whenN = 1, i.e. no locking required, latency is 0. Latency is computedfor each packet, for each batch we take the 99th percentile amongall latency values, finally we take the maximum among all batchesfor a given trace. For example, a latency value of 5 means thatin the worst case, 99% of the packets experienced a latency of nomore than 5 clock cycles, e.g. 5ns at 1Ghz.

We evaluated these metrics when varying the different parame-ters described in Section 4 for the different traces. We present herea subset of the results, a more detailed collection of results can befound at [20].

Table 2 shows results in terms of clock cycle budget, which isthe maximum number of clock cycles allowed for a stateful func-tion to complete execution, while sustaining a given throughput.For example, to sustain 100% throughput, using queues of size 10(headers) does not provide any benefit, as the clock cycle budgetis 1 for each trace and flow key. However, by adding more capac-ity to queues up to 100, budget improves even when using only 1queue, allowing for functions spanning 20 clock cycles for all flowkeys with chi-15, and 4 clock cycles with sj-12, but only when ag-gregating packets per 5-tuple. Clearly, long queues impact latency.Both clock cycle budget and latency improve if we can admit for alower throughput of 99.9%, i.e. allowing for 0.1% drop probabil-ity. Clearly, reducing utilization (100% in our experiments) reducesfurther the risk of drop while maintaining the same cycle budget.

6. DISCUSSIONIssues with blocking architectures While the proposed solution

enables the execution of more complex operations directly in thedata plane, it implements a blocking architecture. That is, for par-ticular workloads, the data plane is unable to offer line rate forward-ing throughput. As a consequence, the processing programmedin the data plane should be adapted to the expected network loadcharacteristics, for the line rate to be achievable. Our work helpsin defining the boundaries of the achievable performance, for agiven workload and set of operations. An additional problem ofthe dependency on the workload is the possibility to exploit suchdependency, e.g., to perform a denial of service attack on the data

plane. However, the ability to program stateful algorithms in thedata plane should help in detecting and mitigating such exploita-tions at little cost.

What can we do more with more clock cycles? We do not haveyet a concrete example of application, however, if one can toleratea blocking architecture, she can trade the complexity of investigat-ing hardware circuit design for a specific function, e.g. to enforceatomic execution as in Banzai [7], with the possibility of using sim-pler but slower hardware blocks. To the far end, we envision thepossibility of using a general purpose packet processor, carefullyprogrammed to complete execution in a longer, but bounded, clockcycle budget, with predictable performance when the traffic char-acteristics are known. Finally, another option is that of supportinglarger memories (e.g. DRAM) which have slower access times (>1 clock tick at 1GHz) to read and write values. The same multi-queue scheduling approach could be used to coordinate access tomultiple parallel memory banks, where each bank is associated toa queue.

7. CONCLUSIONThis paper presented a model for a packet processing pipeline

which allows execution of complex functions that read and writedata plane’s state at line rate, when read and write operations areperformed at different stages. Prevention of data hazards is per-formed by stalling the pipeline. By using simulations on real traf-fic traces from both carrier and datacenter networks, we show thatsuch model can be applied with little or no throughput degradation.The exact clock cycle budget and latency depends on the packetsize distribution (more with larger packets) and on the granularityof the flow key used to access state (more with longer flow keys,e.g. the 5-tuple). The code used for the simulations and additionalresults are available at [20].

AcknowledgementsThis work has been partly funded by the EU in the context of theH2020 “BEBA” project (Grant Agreement: 644122).

Programmable statefulswitching chip

Clock cycle budget (and latency) for all traces:Maximum number of clock cycles (limited to 30) per processing function, to sustain a given throughput.W = 4 bits. Latency values are given for 1 Ghz clock frequency, i.e. 1 clock cycle = 1 ns.

ParserMixer...

Ingress queues

2. PROBLEM STATEMENTØ Pipelining is the way to scale for high throughput (Tb/s)Ø When pipelining, accessing state at different stages of the

pipeline can cause data hazards

Research question:Ø If we allow for stateful processing blocks that span many clock cycles:

§ What is the risk of data hazards with realistic traffic conditions?§ What is the throughput when using locking to access the memory?§ How much silicon is needed to implement such a locking scheme?

Read X X++ Write

X

X (state)

3-stage pipeline

Example:A packet counter. For each packet increase the value of X.

pkt pkt

Data hazard!Outdated state

Read, increment, write X

X (state)Memory must operate at pipeline frequency.Limited size, can’t support slower but larger memories (e.g. DRAM)

Support limited actions with simple circuitry that can meet pipeline timing. E.g. Can’t do SQRT in 1 clock cycle at 1Ghz

Banzai atomic model

If an algorithm can be mapped to an atom pipeline, then it is guaranteed to run at line rate, with any traffic workload.

Pros Cons

Switch’s pipeline

Parser ….Stage 1 Stage S

Stage’s pipeline

Simulation results:

1. Each packet is associated with a flow key (FK)2. A dispatcher enqueues packets by hashing on the FK3. A round-robin scheduler decides if a queue can be served by looking at the

head-of-line’s FK and comparing it to what is currently in the stage’s pipeline.4. Comparison is performed by reducing the space of the FK to few bits (W)

1

2

Q

State

1 2 3 N

Processing function pipeline

read(FK)write(FK)

Round-robin scheduler(W-bits comparator)

Flow key extrator

hdr

Disp

atch

erFKhdr

q = hash(FK) mod Q

hdr

How it works:

0.10.20.30.40.50.60.70.80.9

1

0 5 10 15 20 25 30

Thro

ughp

ut [m

in]


caida-sj12-opp-dimquelen-Q-W=(10, 1, 16), util=1 (avg)

Flow key* (global)

5-tupleipdst

ipdst/16

0.10.20.30.40.50.60.70.80.9

1

0 5 10 15 20 25 30

Thro

ughp

ut [m

in]



Flow key* (global)

5-tupleipdst

ipdst/16

0.10.20.30.40.50.60.70.80.9

1

0 5 10 15 20 25 30

Thro

ughp

ut [m

in]



Flow key* (global)

5-tupleipdst

ipdst/16

100

101

102

103

0 5 10 15 20 25 30

99th

% la

tenc

y (c

ycle

s) [m

ax]



Flow key* (global)

5-tupleipdst

ipdst/16

100

101

102

103

0 5 10 15 20 25 30

99th

% la

tenc

y (c

ycle

s) [m

ax]



Flow key* (global)

5-tupleipdst

ipdst/16

100

101

102

103

0 5 10 15 20 25 30

99th

% la

tenc

y (c

ycle

s) [m

ax]



Flow key* (global)

5-tupleipdst

ipdst/16

§ Trace sj-12 (worst case)§ Queue size = 10 headers, W = 4 bits

1 queue 4 queues 8 queues

Silicon overhead

Queues: memory requirements§ 88 bytes header size

(80 + 8 bytes metadata)§ 4 queues§ Queue size = 100 headers§ 35.2 KB ~ 3.5% of an RMT stage

Scheduler: combinatorial logic§ O(Q×W×N)§ Few tens of thousands of logic

gates with Q, W, N used in simulations

§ ASIC todays have more than 108

gates

0.10.20.30.40.50.60.70.80.9

1

0 5 10 15 20 25 30

Thro

ughp

ut [m

in]



Flow key* (global)

5-tupleipdst

ipdst/16

area amministrazione e finanza area ...fraction of data hazards (fdh) per batch of 100k packets...

Documents