area amministrazione e finanza area ...fraction of data hazards (fdh) per batch of 100k packets...
TRANSCRIPT
Partly funded by EU H2020 projects:
Relaxing constraints in stateful network data plane design
Carmelo Cascone^, Roberto Bifulco*, Salvatore Pontarelli+, Antonio Capone^^ Politecnico di Milano (Italy), *NEC Laboratories Europe (Germany), +Univ. Roma Tor Vergata (Italy)
“BEBA” grant agreement 644122 “VirtuWind” grant agreement 671648
AREA SERVIZI AGLI STUDENTI E AI DOTTORANDI
AREA SERVIZI AGLI STUDENTI E AI DOTTORANDI
AREA SVILUPPO E RAPPORTI CON LE IMPRESE
AREA SVILUPPO E RAPPORTI CON LE IMPRESE
AREA RISORSE UMANE E ORGANIZZAZIONE
AREA RISORSE UMANE E ORGANIZZAZIONE
DIREZIONE GENERALE
AREA AMMINISTRAZIONE E FINANZA
DIREZIONE GENERALE
AREA AMMINISTRAZIONE E FINANZA
AREA GESTIONE INFRASTRUTTURE E SERVIZI
AREA GESTIONE INFRASTRUTTURE E SERVIZI
AREA COMUNICAZIONE E RELAZIONI ESTERNE
AREA COMUNICAZIONE E RELAZIONI ESTERNE
6.0 | GERARCHIE FUNZIONALI | SCUOLA / DIPARTIMENTO / AMMINISTRAZIONE
1. INTRODUCTIONProgram and run at line rate algorithms that read and modify data plane’s state
§ E.g.: Stateful firewall, dynamic NAT, flowlet load balancing, AQM, measurement, etc.
State
f(pkt, state)pkt
State of the artØ Reconfigurable Match Table (RMT) [SIGCOMM 2013]
§ Programmable parser, actions, table size, stateless§ 640 Gb/s non-blocking line rate, 6.5 Tb/s Tofino Chip
Ø Domino/Banzai [SIGCOMM 2016]§ Extends RMT with support for stateful actions named “Atoms”§ Atoms can be pipelined to implement complex algorithms§ Strict constraints on the atom execution time§ State read-modify-write must be executed in 1 clock tick
0
0.2
0.4
0.6
0.8
1
0 200 400 600 800 1000 1200 1400
CD
F
Bytes
mawi-15chi-15
sj-12fb-web
Packet size distribution
3. OBSERVATIONS1. Pipeline’s header processing rate depends on packet size
§ Packets are read from input ports in chunks, e.g. 80 bytes in RMT§ 80 bytes * 1 Ghz (chip clock freq.) = 640 Gb/s line rate
§ Larger packets cause inter-packet idle cycles➞ Minimize risk of data hazards
2. Distinguish between per-flow and global state§ Global: shared among all packets§ Per-flow: shared by packets of the same flow
➞ Different flows can be processed in parallel§ E.g. a stateful firewall needs only per-flow state, a DNAT both
4. MOTIVATING EXPERIMENTSEvaluate the risk of data hazards using simulations with real traffic traces
Hea
ders
…Pk
tdat
a………… Data-path width
E.g. 1280 bit in RMT(640 headers + 640 metadata)
Hea
ders
…Pk
tdat
a…
Hea
ders
Hea
ders
Hea
ders
…
Bits
PipelineMinimum size
packetsInter-packet idle cycles
(e.g. in RMT 18 with 1500 bytes packet)
0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24
0 5 10 15 20 25 30
FDH
(99t
h pe
rc.)
Pipeline depth (clock cycles)
mawi-15 5-tupleipdst
ipdst/16
chi-15 5-tupleipdst
ipdst/16
sj-12 5-tupleipdst
ipdst/16
0
0.1
0.2
0.3
0.4
0 5 10 15 20 25 30
FDH
(99t
h pe
rc.)
mawi-15 chi-15 sj-12 fb-web
0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24
0 5 10 15 20 25 30
FDH
(99t
h pe
rc.)
Pipeline depth (clock cycles)
mawi-15 5-tupleipdst
ipdst/16
chi-15 5-tupleipdst
ipdst/16
sj-12 5-tupleipdst
ipdst/16
Simulation results:Fraction of data hazards (FDH) per batch of 100k packets§ Pipeline’s read chunk = 80 bytes§ All packets back-to-back
i.e. 100% pipeline utilization
Trace Provider Description Date Num pktsNum flows per 1m pkts
5-tuple ipdst ipdst/16
chi-15 CAIDA 10Gb/s backbone link in Chicago.Usual conditions.
Feb 19, 2015 3.5b 100.6k 57.7k 4.6k
sj-12 CAIDA 10Gb/s backbone link in San Jose.Unusually high number of 5-tuples.
Nov 15, 2012 3.6b 249k 17k 2k
mawi-15 MAWI 1Gb/s backbone link in Japan.High volume of anomalous traffic.
Jul 21, 2015 135m 40.8k 17.3k 1.7k
fb-web Facebook Packet samples from 10 most active ToR switches in a web cluster
2015 447m n/a n/a n/a
Global state Per-flow state
5. APPROACH: MEMORY LOCKINGIf two packets of the same flow arrive back-to-back, processing is paused for the second packet until the first one has left the stage pipeline.
Table 2: Clock cycle budget (and latency) when using memory lockingMaximum number of clock cycles (up to 30) per processing function, to sustain a given throughput. In all cases W = 4 bits. “Global”
represents the case when packets need to access global state. Latency values are given for 1 Ghz clock frequency, i.e. 1 clock cycle = 1 ns.chi-15 sj-12 mawi-15 fb-web
Thrpt Qlen Q 5-tuple ipdst ipdst/16 global 5-tuple ipst ipdst/16 global 5-tuple ipdst ipdst/16 global global
100%
101 1 1 1 1 1 1 1 1 1 1 1 1 14 1 1 1 1 1 1 1 1 18 1 1 1 1 1 1 1 1 116 1 1 1 1 1 1 1 1 1
1001 20 (174ns) 20 (190ns) 21 (230ns) 8 (282ns) 4 (49ns) 1 1 1 2 (18ns) 2 (20ns) 2 (35ns) 1 2 (10ns)4 30 (175ns) 30 (192ns) 30 (320ns) 8 (152ns) 1 1 2 (12ns) 2 (14ns) 2 (25ns)8 30 (137ns) 30 (144ns) 30 (259ns) 8 (133ns) 1 1 2 (12ns) 2 (14ns) 2 (25ns)16 30 (122ns) 30 (126ns) 30 (221ns) 8 (123ns) 1 1 2 (11ns) 2 (14ns) 2 (24ns)
99.9%
101 8 (16ns) 8 (16ns) 8 (18ns) 4 (18ns) 2 (5ns) 1 1 1 1 1 1 1 2 (10ns)4 14 (33ns) 14 (31ns) 14 (38ns) 2 (4ns) 1 1 1 1 18 16 (39ns) 15 (30ns) 16 (44ns) 2 (4ns) 1 1 1 1 116 17 (37ns) 18 (43ns) 16 (42ns) 2 (5ns) 1 1 1 1 1
1001 27 (568ns) 27 (618ns) 26 (605ns) 8 (282ns) 6 (143ns) 2 (86ns) 2 (84ns) 1 3 (42ns) 3 (48ns) 3 (100ns) 2 (60ns) 2 (10ns)4 30 (175ns) 30 (192ns) 30 (320ns) 15 (526ns) 2 (79ns) 2 (77ns) 4 (41ns) 4 (52ns) 4 (135ns)8 30 (137ns) 30 (144ns) 30 (259ns) 22 (731ns) 2 (79ns) 2 (78ns) 4 (38ns) 4 (50ns) 4 (131ns)16 30 (122ns) 30 (126ns) 30 (221ns) 25 (741ns) 2 (79ns) 2 (72ns) 4 (37ns) 4 (49ns) 4 (129ns)
99%
101 21 (80ns) 21 (80ns) 21 (89ns) 7 (60ns) 3 (14ns) 1 1 1 2 (13ns) 2 (14ns) 1 1 2 (10ns)4 30 (142ns) 30 (148ns) 30 (184ns) 10 (45ns) 1 1 5 (34ns) 4 (31ns) 2 (18ns)8 30 (129ns) 30 (138ns) 30 (184ns) 11 (47ns) 1 1 6 (42ns) 4 (31ns) 2 (18ns)16 30 (116ns) 30 (122ns) 30 (180ns) 12 (47ns) 1 1 7 (52ns) 4 (31ns) 2 (18ns)
1001 30 (950ns) 30 (922ns) 30 (1.1us) 9 (842ns) 8 (268ns) 2 (86ns) 2 (84ns) 2 (171ns) 9 (422ns) 8 (380ns) 5 (316ns) 4 (379ns) 2 (10ns)4 30 (175ns) 30 (192ns) 30 (320ns) 22 (1.1us) 3 (285ns) 3 (283ns) 24 (1.7us) 17 (1.3us) 7 (572ns)8 30 (137ns) 30 (144ns) 30 (259ns) 30 (1.9us) 3 (290ns) 3 (292ns) 30 (2.1us) 23 (2.2us) 8 (753ns)16 30 (122ns) 30 (126ns) 30 (221ns) 30 (940ns) 3 (296ns) 3 (293ns) 30 (1.3us) 25 (2.5us) 8 (759ns)
and the latency as the number of clock cycles from when the packetis completely received to when it is served by the scheduler, i.e. itenters the function pipeline. For simplicity we consider that whenN = 1, i.e. no locking required, latency is 0. Latency is computedfor each packet, for each batch we take the 99th percentile amongall latency values, finally we take the maximum among all batchesfor a given trace. For example, a latency value of 5 means thatin the worst case, 99% of the packets experienced a latency of nomore than 5 clock cycles, e.g. 5ns at 1Ghz.
We evaluated these metrics when varying the different parame-ters described in Section 4 for the different traces. We present herea subset of the results, a more detailed collection of results can befound at [20].
Table 2 shows results in terms of clock cycle budget, which isthe maximum number of clock cycles allowed for a stateful func-tion to complete execution, while sustaining a given throughput.For example, to sustain 100% throughput, using queues of size 10(headers) does not provide any benefit, as the clock cycle budgetis 1 for each trace and flow key. However, by adding more capac-ity to queues up to 100, budget improves even when using only 1queue, allowing for functions spanning 20 clock cycles for all flowkeys with chi-15, and 4 clock cycles with sj-12, but only when ag-gregating packets per 5-tuple. Clearly, long queues impact latency.Both clock cycle budget and latency improve if we can admit for alower throughput of 99.9%, i.e. allowing for 0.1% drop probabil-ity. Clearly, reducing utilization (100% in our experiments) reducesfurther the risk of drop while maintaining the same cycle budget.
6. DISCUSSIONIssues with blocking architectures While the proposed solution
enables the execution of more complex operations directly in thedata plane, it implements a blocking architecture. That is, for par-ticular workloads, the data plane is unable to offer line rate forward-ing throughput. As a consequence, the processing programmedin the data plane should be adapted to the expected network loadcharacteristics, for the line rate to be achievable. Our work helpsin defining the boundaries of the achievable performance, for agiven workload and set of operations. An additional problem ofthe dependency on the workload is the possibility to exploit suchdependency, e.g., to perform a denial of service attack on the data
plane. However, the ability to program stateful algorithms in thedata plane should help in detecting and mitigating such exploita-tions at little cost.
What can we do more with more clock cycles? We do not haveyet a concrete example of application, however, if one can toleratea blocking architecture, she can trade the complexity of investigat-ing hardware circuit design for a specific function, e.g. to enforceatomic execution as in Banzai [7], with the possibility of using sim-pler but slower hardware blocks. To the far end, we envision thepossibility of using a general purpose packet processor, carefullyprogrammed to complete execution in a longer, but bounded, clockcycle budget, with predictable performance when the traffic char-acteristics are known. Finally, another option is that of supportinglarger memories (e.g. DRAM) which have slower access times (>1 clock tick at 1GHz) to read and write values. The same multi-queue scheduling approach could be used to coordinate access tomultiple parallel memory banks, where each bank is associated toa queue.
7. CONCLUSIONThis paper presented a model for a packet processing pipeline
which allows execution of complex functions that read and writedata plane’s state at line rate, when read and write operations areperformed at different stages. Prevention of data hazards is per-formed by stalling the pipeline. By using simulations on real traf-fic traces from both carrier and datacenter networks, we show thatsuch model can be applied with little or no throughput degradation.The exact clock cycle budget and latency depends on the packetsize distribution (more with larger packets) and on the granularityof the flow key used to access state (more with longer flow keys,e.g. the 5-tuple). The code used for the simulations and additionalresults are available at [20].
AcknowledgementsThis work has been partly funded by the EU in the context of theH2020 “BEBA” project (Grant Agreement: 644122).
Programmable statefulswitching chip
Clock cycle budget (and latency) for all traces:Maximum number of clock cycles (limited to 30) per processing function, to sustain a given throughput.W = 4 bits. Latency values are given for 1 Ghz clock frequency, i.e. 1 clock cycle = 1 ns.
ParserMixer...
Ingress queues
2. PROBLEM STATEMENTØ Pipelining is the way to scale for high throughput (Tb/s)Ø When pipelining, accessing state at different stages of the
pipeline can cause data hazards
Research question:Ø If we allow for stateful processing blocks that span many clock cycles:
§ What is the risk of data hazards with realistic traffic conditions?§ What is the throughput when using locking to access the memory?§ How much silicon is needed to implement such a locking scheme?
Read X X++ Write
X
X (state)
3-stage pipeline
Example:A packet counter. For each packet increase the value of X.
pkt pkt
Data hazard!Outdated state
Read, increment, write X
X (state)Memory must operate at pipeline frequency.Limited size, can’t support slower but larger memories (e.g. DRAM)
Support limited actions with simple circuitry that can meet pipeline timing. E.g. Can’t do SQRT in 1 clock cycle at 1Ghz
Banzai atomic model
If an algorithm can be mapped to an atom pipeline, then it is guaranteed to run at line rate, with any traffic workload.
Pros Cons
Switch’s pipeline
Parser ….Stage 1 Stage S
Stage’s pipeline
Simulation results:
1. Each packet is associated with a flow key (FK)2. A dispatcher enqueues packets by hashing on the FK3. A round-robin scheduler decides if a queue can be served by looking at the
head-of-line’s FK and comparing it to what is currently in the stage’s pipeline.4. Comparison is performed by reducing the space of the FK to few bits (W)
1
2
Q
State
1 2 3 N
Processing function pipeline
read(FK)write(FK)
Round-robin scheduler(W-bits comparator)
Flow key extrator
hdr
Disp
atch
erFKhdr
q = hash(FK) mod Q
hdr
How it works:
0.10.20.30.40.50.60.70.80.9
1
0 5 10 15 20 25 30
Thro
ughp
ut [m
in]
Pipeline depth (clock cycles)
caida-sj12-opp-dimquelen-Q-W=(10, 1, 16), util=1 (avg)
Flow key* (global)
5-tupleipdst
ipdst/16
0.10.20.30.40.50.60.70.80.9
1
0 5 10 15 20 25 30
Thro
ughp
ut [m
in]
Pipeline depth (clock cycles)
caida-sj12-opp-dimquelen-Q-W=(10, 4, 16), util=1 (avg)
Flow key* (global)
5-tupleipdst
ipdst/16
0.10.20.30.40.50.60.70.80.9
1
0 5 10 15 20 25 30
Thro
ughp
ut [m
in]
Pipeline depth (clock cycles)
caida-sj12-opp-dimquelen-Q-W=(10, 8, 16), util=1 (avg)
Flow key* (global)
5-tupleipdst
ipdst/16
100
101
102
103
0 5 10 15 20 25 30
99th
% la
tenc
y (c
ycle
s) [m
ax]
Pipeline depth (clock cycles)
caida-sj12-opp-dimquelen-Q-W=(10, 1, 16), util=1 (avg)
Flow key* (global)
5-tupleipdst
ipdst/16
100
101
102
103
0 5 10 15 20 25 30
99th
% la
tenc
y (c
ycle
s) [m
ax]
Pipeline depth (clock cycles)
caida-sj12-opp-dimquelen-Q-W=(10, 4, 16), util=1 (avg)
Flow key* (global)
5-tupleipdst
ipdst/16
100
101
102
103
0 5 10 15 20 25 30
99th
% la
tenc
y (c
ycle
s) [m
ax]
Pipeline depth (clock cycles)
caida-sj12-opp-dimquelen-Q-W=(10, 8, 16), util=1 (avg)
Flow key* (global)
5-tupleipdst
ipdst/16
§ Trace sj-12 (worst case)§ Queue size = 10 headers, W = 4 bits
1 queue 4 queues 8 queues
Silicon overhead
Queues: memory requirements§ 88 bytes header size
(80 + 8 bytes metadata)§ 4 queues§ Queue size = 100 headers§ 35.2 KB ~ 3.5% of an RMT stage
Scheduler: combinatorial logic§ O(Q×W×N)§ Few tens of thousands of logic
gates with Q, W, N used in simulations
§ ASIC todays have more than 108
gates
0.10.20.30.40.50.60.70.80.9
1
0 5 10 15 20 25 30
Thro
ughp
ut [m
in]
Pipeline depth (clock cycles)
caida-sj12-opp-dimquelen-Q-W=(10, 4, 16), util=1 (avg)
Flow key* (global)
5-tupleipdst
ipdst/16