accelerating networked applications with flexible packet … · with flexible packet processing...

Accelerating Networked Applications with Flexible Packet Processing

AntoineKaufmann,NaveenKr.Sharma,ThomasAnderson,ArvindKrishnamurthy

TimothyStamler, SimonPeter

UniversityofWashington The UniversityofTexasatAustin

Networks are becoming faster

100MbE

40GbE100GbE

400GbE

1990 1995 2000 2005 2010 2015 2020

rnetBandw

idth[b

its/s]

YearofStandardRelease

5nsinter-arrivaltimefor64Bpacketsat100Gbps

...but software packet processing is slow

Recv+send TCP stack processing time (2.2 GHz)▪ Linux: 3.5µs▪ Kernel bypass: ~1µs

Single core performance has stalledParallelize? Assuming 1µs over 100Gb/s, excluding Amdahl‘s law:▪ 64B packets => 200 cores▪ 1KB packets => 14 cores

Many cloud apps dominated by packet processing▪ Key-value storage, real-time analytics, intrusion detection, file service, ...▪ All rely on small messages: latency & throughput equally important

What are the alternatives?RDMA▪ Bypasses server software entirely▪ Not well matched to client/server processing (security, two-sided for RPC)

Full application offload to NIC (FPGA, etc.)▪ Application now at slower hardware-development speed▪ Difficult to change once deployed

Fixed-function offloads (segmentation, checksums, RSS)▪ Good start!▪ Too rigid for today’s complex server & network architecture (next slide)

Flexible function offload to NIC (NFP, FlexNIC, etc.)▪ Break down functions (eg., RSS) and provide API for software flexibility

Fixed-function offloads are not well integrated

Wasted CPU cycles▪ Packet parsing and validation repeated in software▪ Packet formatted for network, not software access▪ Multiplexing, filtering repeated in software

Poor cache locality, extra synchronization▪ NIC steers packets to cores by connection▪ Application locality may not match connection

A more flexible NIC can help

With multi-core, NIC needs to pick destination core▪ The “right” core is application specific

NIC is perfectly situated – sees all traffic▪ Can scalably preprocess packets according to software needs▪ Can scalably forward packets among host CPUs and network

With kernel-bypass, only NIC can enforce OS policy▪ Need flexible NIC mechanisms, or go back into kernel

Talk Outline

• Motivation• FlexNIC model

• Experience with Agilio-CX as prototyping platform• Accelerating packet-oriented networking (UDP, DCCP)

• Key-value store• Real-time analytics• Network Intrusion Detection

• WiP: Accelerating stream-oriented networking (TCP)

FLEXNIC MODEL

FlexNIC: A Model for Integrated NIC/SW Processing[ASPLOS’16]

• Implementable at Tbps line rate & low cost

Match+action pipeline:

ActionALU

MatchTable

Parser

M+AStage1 M+A2

ExtractedHeaderFields

Packet

ModifiedFields

Match+Action Programs

Supports: Does not support:

Match:IF udp.port ==kvs

Action:core=HASH(kvs.key)%ncoresDMA hash,kvs TO Cores[core]

LoopsComplex calculationsKeeping large state

Steer packetCalculate hash/XsumInitiate DMA operationsTrigger reply packetModify packets

FlexNIC: M+A for NICs

Efficient application level processing in the NIC▪ Improve locality by steering to cores based on app criteria▪ Transform packets for efficient processing in SW▪ DMA directly into and out of application data structures▪ Send acknowledgements on NIC

IngressPipeline

EgressPipeline

DMAPipeline

Queues

Netronome Agilio-CX

We use Agilio-CX to prototype FlexNIC• Implement M&A programs in P4• Run on NIC

Our experience with Agilio-CX:▪ Improve locality by steering to cores based on app criteria▪ Transform packets for efficient processing in SW▪ DMA directly into and out of application data structures▪ Send acknowledgements on NIC

ACCELERATING PACKET-ORIENTED NETWORKING

Example: Key-Value Store

HashTable

Core2NIC

Receive-sidescaling:core=hash(connection)%N

Client1K= 3,4

Client2K=4,7

Client3K=7,8

• Lockcontention• Poorcacheutilization

Key-based Steering

Core2NIC

HashTable

Client1K=3,4

Client2K=4,7

Client3K=7,8

Match:IF udp.port ==kvsAction:core=HASH(kvs.key)%NDMA hash,kvs TO Cores[core]

• Nolocksneeded• Highercacheutilization

Custom DMA

DMA to application-level data structuresRequires packet validation and transformation

ItemLog

EventQueue

Item1 Item2

GET,ClientID,Hash,KeySET,ClientID,ItemPointer

Evaluation of the Model

• Measure impact on application performance• Key-based steering: Use NIC• Custom DMA: Software emulation of M&A pipeline

• Workload: 100k 32B keys, 64B values, 90% GET• 6 Core Sandy Bridge Xeon 2.2GHz, 2x10G links

Key-based steering

• Better scalability▪ PCIe is bottleneck for 4+ cores

• 45% higher throughput• Processing time reduced to 310ns

1 2 3 4 5Throughp

NumberofCPUCores

FlexKVS/RSS

FlexKVS/Key

FlexKVS/Linux

MemcachedCustomDMAreducestimeto200ns

Real-time Analytics System

(De-)Multiplexing threads are performance bottleneck• 2 CPUs required for 10 Gb/s => 20 CPUs for 100 Gb/s

Software

RxQueue

TxQueue

DemuxACKs Mux

Real-time Analytics System

Offload (de)multiplexing and ACK generation to FlexNIC• No CPUs needed => Energy-efficiency

Software

RxQueue

TxQueue

DemuxACKs Mux

Performance Evaluation

Balanced Grouped

Throughp

tuples/s]

ApacheStormFlexStorm/LinuxFlexStorm/BypassFlexStorm/FlexNIC.5x

• Clusterof3machines• DetermineTop-nTwitterposters(realtrace)• Measureattainablethroughput

Network Intrusion Detection

Snort sniffs packets and analyzes them• Parallelized by running multiple instances• Status quo: Receive-side scaling

FlexNIC:• Analyze rules loaded into Snort• Partition rules among cores to maximize caching• Fine-grained steering to cores

Result: 1.6x higher throughput, 30% fewer cache misses

ACCELERATING STREAM-ORIENTED NETWORKING

Ongoing work: Stream protocols

Full TCP processing is too complex for M&A processing▪ Significant connection state required▪ Tricky edge cases: reordering, drops▪ Complicated algorithms for congestion control

But the common case is simpler: it can be offloaded▪ Reduces the critical path in software

Opportunity: Enforce correct protocol onto untrusted app▪ Focus: congestion control

FlexTCP ideas

Safety critical & common processing on NIC▪ Includes filtering, validating ACKs, enforcing rate limits

Handle all non-common cases in software▪ E.g. packet drops, re-ordering, timeouts, …

Requires small per-flow state▪ 64 bytes (SEQ/ACK, queues, rate-limit, …)

FlexTCP overview

Flexible congestion control offloadNIC enforces per-flow rate limits set by trusted kernel▪ Flexibility to choose congestion control

Example: DCTCPCommon-case processing on NIC▪ Echo ECN marks in generated ACK▪ Track fraction of ECN marked packets per flow

Kernel implements control policy (DCTCP)▪ Use NIC-reported fraction of packets that are ECN marked▪ Adapt rate limit according to DCTCP protocol

Result: Indistinguishable from pure software implementations

FlexTCP overhead evaluation

• We implemented FlexTCP in P4• Run on Agilio-CX with null application• Compare throughput to basic NIC (wiretest)

256 512 1024 1500

Throughp

Packetsize[Bytes]

Summary

Networks are becoming faster, CPUs are not▪ Server applications need to keep up▪ Fast I/O requires efficient I/O path to application

Flexible offloads can eliminate inefficiencies▪ Application control over where packets are processed▪ Efficient steering, validation, transformation

Case studies: Key-value store, real-time analytics, IDS▪ Up to 2.5x throughput & latency improvement vs. kernel-bypass▪ Vastly more energy-efficient (no CPUs for packet processing)

accelerating networked applications with flexible packet … · with flexible packet processing...

Documents

2013 krishnamurthy

kannan krishnamurthy profile powerpoint presentation-1

jyotish krishnamurthy paddhati bansal

senthil krishnamurthy certificate

design, implementation and evaluation of a client...

flexible packet matching - cisco · flexible packet...

ptree: a system for flexible, efficient packet...

jyotish krishnamurthy padhdhati joshi

reforging sail - leadership lessons from v krishnamurthy

rpt oracle plugin - anitha krishnamurthy

mohammed, abdelmawgoud, and krishnamurthy, r.v

accelerating networked applications with flexible packet...

bhartrihari satakatrayam kosambi d.d., krishnamurthy...

hypergen: high-performance flexible packet...

activflex 6500 packet- optical...

vig krishnamurthy: project portfolio (august 2012)

k.s. krishnamurthy (1908-1972)

krishnamurthy system of astrology

application-driven development of flexible packet-oriented

eiffel: efﬁcient and flexible software packet...