brandon heller block design review: substrate decap and ipv4 parse
DESCRIPTION
3 - Brandon Heller - 1/19/2016 Contents Lookup Rx Tx QM Parse Header Format Substr Decap slide taken from PlanetLab_Design.ppt For SD and Parse: »overview »block diagram »memory usage »code locations »test procedures Performance analysis »Unexpected interactions »Future workTRANSCRIPT
Brandon [email protected]
http://www.arl.wustl.edu/projects/techX
Block Design Review:
Substrate Decap and IPv4 Parse
2 - Brandon Heller - 05/03/23
Revision History 9/26/06 (BDH):
»Released 9/28/06 (BDH):
»SD now at 5Gbps+
3 - Brandon Heller - 05/03/23
Contents
LookupRx TxQMParse HeaderFormat
SubstrDecap
slide taken from PlanetLab_Design.ppt
For SD and Parse:»overview»block diagram »memory usage»code locations»test procedures
Performance analysis»Unexpected interactions»Future work
Substrate Decap
5 - Brandon Heller - 05/03/23
Substrate Decap
LookupRx TxQMParse HeaderFormat
SubstrDecap
slide taken from PlanetLab_Design.ppt
Main functions:»validate & consume Ethernet header»look up code_option and slice_data_ptr based on VLAN tag»validate & consume substrate UDP/IP headers»pass relevant fields to IPv4 parse
Single code path NN communication Uses 8 threads Name change from Demux
6 - Brandon Heller - 05/03/23
IPv4 MR Functional Blocks
LookupRx TxQMParse HeaderFormat
SubstrDecap
Buf Handle(32b)Port(8b)
Reserved(8b)
Eth. FrameLen (16b)
Type=802.1Q (2B)
PAD (nB)CRC (4B)
UDP Payload(MN Packet)
Dst Addr (4B)Src Addr (4B)
Ver/HLen/Tos/Len (4B)ID/Flags/FragOff (4B)
TTL (1B)Protocol = UDP (1B)
Hdr Cksum (2B)
DstAddr (6B)SrcAddr (6B)
IP Options (0-40B)Src Port (2B)Dst Port (2B)
UDP length (2B)UDP checksum (2B)
VLAN (2B)Type=IP (2B) Et
hern
etHe
ader
IPHe
ader
UDP
Head
erEt
hern
etTr
aile
r
Rx UDP DPort (16b)
Buf Handle(32b)
Slice ID (VLAN) (16b)
MN Frm Offset (16b)MN Frm Length(16b)
Rx IP SAddr (32b)Reserved
(12b)Rx UDP SPort (16b) Code(4b)
Slice Data Ptr (32b)
slide taken from PlanetLab_Design.ppt
7 - Brandon Heller - 05/03/23
Ethernet Validation No alignment necessary Counters kept in non-VLAN-specific region Tests for
» invalid Ethernet packet length» non-VLAN tag protocol ID» non-locally-addressed packet» unrecognized VLAN
8 - Brandon Heller - 05/03/23
VLAN TableVLAN code_opt slice_data_ptr
0 0 01 0 0… … …0xaaa 1… … …0xfff 0 0
…
SD dataP dataHF data
…
code_option = 0 implies invalid slice»“on switch” for a slice in the data plane
SD data is currently only counters 64B slice data SRAM space for all 4096 VLANs
9 - Brandon Heller - 05/03/23
Substrate UDP/IP Validation Header checks per RFC1812:
» IP ver other than 4» invalid header length» length too small» IP len doesn't match Enet-deduced IP len» UDP len doesn't match IP-deduced UDP len
NOTE: need to check Ethernet length, to ensure that padded 64B packets are using the correct length
10 - Brandon Heller - 05/03/23
SD Block Diagram
add one 4B SRAM increment per counter (none currently for common case)
Read Eth/IP Hdrs
Validate Ethernet
Read VLAN table
Validate IP
Read UDP hdr
Validate UDP
Prepare ring dataWait for prev ctx
Signal next ctx
NN Enqueue
Wait for prev ctx
Signal next ctx
NN Dequeueinit
signal
substrate_decap()
dl_sink()
dl_source()
DRAM: 5 8B reads
SRAM: 2 4B reads
DRAM: 2 8B reads
mem access
11 - Brandon Heller - 05/03/23
File locations (in …/IPv4_MR/) Code
» src/substrate_decap/PL/substrate_decap.[c,h]» src/dispatch_loop/PL/substrate_decap_dl.[c,h]» src/dispatch_loop/PL/dl_source.[c,h]
dl_source() and dl_sink() functions adds ordered thread synchronization if the following defined:
DL_ORDERED FIRST_ORDERED_ME LAST_ORDERED_ME
» src/IXP2XXX_book/Chapter09/ordered_signal.[c,h] functions for ordered thread synchronization
» src/dispatch_loop/PL/nn_rings.[c,h] functions for enqueuing and dequeuing NN ring data
Data formats» src/PL/ipv4_common.h
IP and UDP structure definitions» src/PL/substrate_common.h
Ethernet VLAN structure definitions» src/dispatch_loop/PL/ring_formats.h
ring data struct defs» build/PL/dispatch_loop/dl_system.h
memory locations
12 - Brandon Heller - 05/03/23
Required Includes Files
»IXA_SDK_4.0\microengineC\src\intrinsic.c»IXA_SDK_4.0\microengineC\src\rtl.c
Directories»IXA_SDK_4.0\src\library\microblocks_library\microc\»IXA_SDK_4.0\MicroengineC\include\..\..\..\..\»IXA_SDK_4.0\src\library\dataplane_library\microc\
These are required to gain access to the buffer libraries and intrinsic functions!
13 - Brandon Heller - 05/03/23
SD Initialization All memory locations defined in dl_system.h, incl:
» locations for MAC address IPV4_SD_MAC_ADDR_HI32 IPV4_SD_MAC_ADDR_LO16
»non-VLAN-specific counters IPV4_SD_COUNTERS_BASE IPV4_SD_COUNTERS_SIZE
»VLAN table IPV4_SD_VLAN_CODE_OPT_TABLE_x (BASE, SIZE, ENTRY_SIZE)
»VLAN-specific memory SLICE_DATA_TABLE_x (BASE, SIZE, ENTRY_SIZE, ENTRY_TOTAL) IPV4_SD_SLICE_DATA_ENTRY_OFFSET
At least one slice must be initialized to send packets»Call init_slice() from system_init.ind»Currently 0xaaa initialized by default»All counters zeroed
SD caches MAC address in registers Thread 0 waits for signal from rx
14 - Brandon Heller - 05/03/23
Substrate Decap Validation All validation tests done with 1 thread and substrate_decap_tests.tcs
» Ethernet validation/counter tests invalid Ethernet packet length non-VLAN tag protocol ID non-locally-addressed packet unrecognized VLAN
» UDP/IP validation/counter tests IP ver other than 4 invalid header length length too small IP len doesn't match Enet-deduced IP len UDP len doesn't match IP-deduced UDP len
» Watched counters for proper number of increments
Fully valid packet: vlan_ip_udp_ip_udp/tcp (speed_test_all_valid.tcs)» Verified all fields of output ring data were as expected» Single-thread plus 8-thread
Hardware testing» Uses Fred’s sp++ utility with a logged trace of the above packets» observed exact same behavior as in simulation
15 - Brandon Heller - 05/03/23
SD Other Bugs
»substrate IP proto not checked, should correspond to UDP Untested
»buffer drops Data Structures
»substrate_decap_vlan_table_entry_t»substrate_decap_stats_t»substrate_decap_vlan_stats_t»vlan_ip_header
ipv4_header_struct vlan_header_struct
»udp_header Performance
»coming later
IPv4 Parse
17 - Brandon Heller - 05/03/23
IPv4 Parse
LookupRx TxQMParse HeaderFormat
SubstrDecap
slide taken from PlanetLab_Design.ppt
Main functions»Read/align IP header»Validate and consume IP header (per RFC1812 5.2.2)»Update IP header
Dec TTL Recalc IP checksum Write updated checksum to DRAM
»Read/align L4 (UDP/TCP/other) header»Mark exceptions for Header Format»Extract fields for Lookup
18 - Brandon Heller - 05/03/23
IPv4 MR Functional Blocks
IPv4 Exception Bits»Bit 0: TTL = 0 or 1»Bit 1: Options
LookupRx TxQMParseHeaderFormatDeMux
Rx UDP DPort (16b)
Buf Handle(32b)
Slice ID (VLAN) (16b)
MN Frm Offset (16b)MN Frm Length(16b)
Rx IP SAddr (32b)Reserved
(12b)Rx UDP SPort (16b) Code(4b)
Lookup Key[111-80] DA (32b)
Buf Handle(32b)IP Pkt Length (16b)IP Pkt Offset (16b)
Lookup Key[ 79-48] SA (32b)Lookup Key[ 47-16] Ports (32b)
Lookup KeyProto/TCP_Flags
[15- 0] (16b)ExceptionBits (12b)
Lookup Key[143-112] Slice ID/Rx UDP DPort (32b)
LFlags(4b)
Slice Data Ptr (32b)
Slice Data Ptr (32b)Reserved
(28b)Code(4b)
19 - Brandon Heller - 05/03/23
Zeros (4b)
IPv4 Internal Header FormatsType (6b) Len (6b)
Type Dependent Data (8B)
Rx UDP DPort (2B)Tx UDP DPort (2B)Tx UDP SPort (2B)
Tx IP DAddr (4B)
Source Category Typebit field
Reason Internal Hdr
RMPE Action
Ingress LC
Normal Fwd None Classify and fwd
GPE No Classify (w/
FwdKey**)
[0] Original pkt , reinjected to data path
Rx UDP DPort + FwdKey
Perform substrate lookup to resolve
LCAddr, port and QID
Classify (w/o
FwdKey)
[1] ICMP or local traffic Rx UDP DPort
Classify and fwd
4 bits at start discriminate between IPv4 and internal headers for more details see planetlab_IPv4_MR_parse_hdr_format.ppt in bdh4\techx\
IPv4_MR_shared
20 - Brandon Heller - 05/03/23
Parse Validation IPv4_parse_tests.tcs
» Invalid internal header invalid len for internal header type internal header type unknown
» Invalid IPv4 (RFC 1812 checks) IP ver other than 4 invalid header length length too small SD IP len doesn't match packet IP len invalid header checksum
» IPv4 Exceptions options flag set in packet TTL equals zero TTL equals one
IPv4_parse_valid.tcs» Fully valid, no-exceptions packets
from GPE, classify from GPE, non-classify ingress, TCP ingress, UDP
21 - Brandon Heller - 05/03/23
Parse Block Diagram
add one 4B SRAM increment per counter (none currently for common case)
Read Int Hdr
Handle Internal
Read IP
Validate IP
Read L4
Handle L4
Prepare ring dataWait for prev ctx
Signal next ctx
NN Enqueue
Wait for prev ctx
Signal next ctx
NN Dequeueinit
signal
ipv4_parse()
dl_sink()
dl_source()DRAM: 2 8B reads
DRAM: 4 8B reads
DRAM: 4 8B reads
mem access
(DRAM: 2 8B reads)
Checksum
22 - Brandon Heller - 05/03/23
File locations (in …/IPv4_MR/) Code
» src/ipv4/PL/ipv4_parse[c,h]» src/dispatch_loop/PL/parse_dl.[c,h]» src/parse/PL/parse.[c,h]» src/dispatch_loop/PL/dl_source.[c,h]
dl_source() and dl_sink() functions adds ordered thread synchronization if the following defined:
DL_ORDERED FIRST_ORDERED_ME LAST_ORDERED_ME
» src/IXP2XXX_book/Chapter09/ordered_signal.[c,h] functions for ordered thread synchronization
» src/dispatch_loop/PL/nn_rings.[c,h] functions for enqueuing and dequeuing NN ring data
Data formats» src/PL/ipv4_common.h
IP and UDP structure definitions» src/dispatch_loop/PL/ring_formats.h
ring data struct defs» build/PL/dispatch_loop/dl_system.h
memory locations
23 - Brandon Heller - 05/03/23
Parse Initialization All memory locations defined in dl_system.h, incl:
»VLAN-specific memory SLICE_DATA_TABLE_x (BASE, SIZE, ENTRY_SIZE, ENTRY_TOTAL) IPV4_PARSE_SLICE_DATA_ENTRY_OFFSET
At least one slice must be initialized to send packets»Call init_slice() from system_init.ind»Currently 0xaaa initialized by default»All counters zeroed
24 - Brandon Heller - 05/03/23
Other Bugs
»none? Untested
»buffer drops Unimplemented
»checksum for IP options not handled yet Data Structures
»parse_vlan_stats_t»ipv4_header_struct»udp_header_struct»tcp_header_struct
Performance»coming next
Performance
26 - Brandon Heller - 05/03/23
Packet SizesEthernet VLAN Header 18BSubstrate Header IPv4 Header 20B UDP Header 8BMetanet Frame GPE to MPE n IPv4 Header 20B UDP Header 8B Payload nEthernet Pad 0Ethernet FCS 4BTotal 78B + internal + payloadEthernet IFS 12BTotal Physical 90B + internal + payload
27 - Brandon Heller - 05/03/23
Cycle Budget (min eth packets) To hit 5Gb rate:
» 76B per min IPv4 packet (64 min Eth + 12B IFS)» 1.4Ghz clock rate» 5 Gb/sec * 1B/8b * packet/76B = 8.22 Mp/sec» 1.4Gcycle/sec * 1 sec/ 8.22 Mp = 170.3 cycles per packet» compute budget: 170 cycles» latency budget: (threads*170)
4 threads : 680 cycles 8 threads: 1360 cycles
28 - Brandon Heller - 05/03/23
Cycle Budget (IPv4 MN packets) To hit 5Gb rate:
» 90B per min IPv4 packet (78 min IPv4MN + 12B IFS)» 1.4Ghz clock rate» 5 Gb/sec * 1B/8b * packet/90B = 6.94 Mp/sec» 1.4Gcycle/sec * 1 sec/ 6.94 Mp = 201.7 cycles per packet» compute budget: 201 cycles» latency budget: (threads*201)
4 threads : 804 cycles 8 threads: 1608 cycles
29 - Brandon Heller - 05/03/23
Performance Anomalies
Substrate Decap
Spot the issue!
these issues have since been fixed!more DRAM contentionunhidden DRAM latency
30 - Brandon Heller - 05/03/23
Substrate Decap Performance Optimized common case (ingress, no options)
»Combined initial header checks»No options assumed single DRAM read
153 cycles typical ~650 cycles latency 337 control store instructions Expected performance
»(201/153)*5Gb = ~6.5Gb expected performance Simulated performance (as of 9/26/2006)
»>5 Gb, but something else slows down 6Gb input
31 - Brandon Heller - 05/03/23
SD Optimizations possible optimizations
» caching VLAN-to-CodeOption table in Local Memory» optimize nn_dequeue_incr() via assembly coding» move VLAN counter computation off fast path?» use transfer regs directly
saves 9 cycles» remove volatile statements
32 - Brandon Heller - 05/03/23
Parse Performance single-threaded
»~380 cycles for computation»1708 cycles latency»556 control store insts
Expected performance»(201/380)*5Gb = <3Gb expected performance
Going to optimize a bit before add all 8 threads
33 - Brandon Heller - 05/03/23
Parse Optimizations possible optimizations
» incremental IPv4 checksum update per RFC1624» checksum computation in assembler » optimized 5LW alignment for IP read» combined initial error-check to optimize common case
reduces branch delays slows down exception path
34 - Brandon Heller - 05/03/23
Implementation Status Parse needs
» error testing» IP options with checksum » multithreading» drop tests
35 - Brandon Heller - 05/03/23
Image Slide Template
36 - Brandon Heller - 05/03/23
Text Slide Template
37 - Brandon Heller - 05/03/23
Extra Slides
38 - Brandon Heller - 05/03/23
Parse Memory Usage Memory reads/writes
» 2 8B DRAM reads: unaligned internal header» 2 8B DRAM reads: unaligned internal header + FwdKey» 4 8B DRAM reads: unaligned IPv4 header» [0,6] DRAM reads: unaligned IPv4 header options» 4 8B DRAM reads: unaligned L4 header» 1 SRAM increment: per counter» 1 DRAM write: updated TTL and checksum
39 - Brandon Heller - 05/03/23
Ethernet Validation First, read packet from memory, guaranteed aligned Not specific to any VLAN - in separate mem area For efficiency, can keep counters in LM and update to RAM when a signal
is triggered
typedef struct _substrate_decap_stats_t{ unsigned int rx; // received unsigned int pass; // passed to next stage unsigned int dropLen // invalid Ethernet packet length unsigned int dropTPID; // non-VLAN tag protocol ID unsigned int dropDst; // non-locally-addressed packet unsigned int dropVLAN; // unrecognized VLAN } substrate_decap_stats_t;
40 - Brandon Heller - 05/03/23
UDP/IP Validationtypedef struct _substrate_decap_slice_stats_t{ unsigned int dropIPVer; // IP ver other than 4 unsigned int dropHdrLen; // invalid header length unsigned int dropLenSmall; // length too small unsigned int dropLenMismatch; // IP len doesn't match Enet IP len unsigned int dropUDPLen; // UDP len doesn't match IP UDP len unsigned int pass; // passed to next stage }substrate_decap_slice_stats_t;
41 - Brandon Heller - 05/03/23
RFC 1812 5.2.2 IP Header Validation(1) The packet length reported by the Link Layer must be large
enough to hold the minimum length legal IP datagram (20 bytes)
(2) The IP checksum must be correct.
(3) The IP version number must be 4. If the version number is not 4 then the packet may be another version of IP, such as IPng or ST-II.
4) The IP header length field must be large enough to hold the minimum length legal IP datagram (20 bytes = 5 words).
(5) The IP total length field must be large enough to hold the IP datagram header, whose length is specified in the IP header length field.
from http://www.faqs.org/rfcs/rfc1812.html