user-space network processing

22
üÎķŀđÑŃœĀ Øå Żā +)*.Æ*+Ò*1Ä 9@ƋþÈ7ÔĖÅ

Upload: ryousei-takano

Post on 13-Apr-2017

4.167 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: User-space Network Processing

  -‐‑‒

+)*. *+ *1 9@ 7

Page 2: User-space Network Processing

– dTCP/IP

• *• mTCP v memcached

– 35%– v

2

*)4 u �v

Page 3: User-space Network Processing

• B3 k6z 2 l

• mTCP +4Intel4DPDK wi• github mTCP+4DPDK

orz• Key4Value4Store w k

Linux l• RADIS →• d v orz

• Memcached →• d

3

Page 4: User-space Network Processing

A G 7 LNPMXXT

4

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100Key size (bytes)

Key size CDF by appearance

USRAPPETCVARSYS

0

0.2

0.4

0.6

0.8

1

1 10 100 1000 10000 100000 1e+06Value size (bytes)

Value Size CDF by appearance

USRAPPETCVARSYS

0

0.2

0.4

0.6

0.8

1

1 10 100 1000 10000 100000 1e+06Value size (bytes)

Value size CDF by total weight

USRAPPETCVARSYS

Figure 2: Key and value size distributions for all traces. The leftmost CDF shows the sizes of keys, up toMemcached’s limit of 250B (not shown). The center plot similarly shows how value sizes distribute. Therightmost CDF aggregates value sizes by the total amount of data they use in the cache, so for example,values under 320B or so in SYS use virtually no space in the cache; 320B values weigh around 8% of the data,and values close to 500B take up nearly 80% of the entire cache’s allocation for values.

3.2 Request SizesNext, we look at the sizes of keys and values in each pool

(Fig. 2), based on SET requests. All distributions showstrong modalities. For example, over 90% of APP’s keys are31 bytes long, and values sizes around 270B show up in morethan 30% of SET requests. USR is the most extreme: it onlyhas two key size values (16B and 21B) and virtually justone value size (2B). Even in ETC, the most heterogeneousof the pools, requests with 2-, 3-, or 11-byte values add upto 40% of the total requests. On the other hand, it also hasa few very large values (around 1MB) that skew the weightdistribution (rightmost plot in Fig. 2), leaving less cachingspace for smaller values.

Small values dominate all workloads, not just in count,but especially in overall weight. Except for ETC, 90% ofall cache space is allocated to values of less than 500B. Theimplications for caching and system optimizations are sig-nificant. For example, network overhead in the processingof multiple small packets can be substantial, which explainswhy Facebook coalesces as many requests as possible in asfew packets as possible [9]. Another example is memoryfragmentation. The strong modality of each workload im-plies that different Memcached pools can optimize memoryallocation by modifying the slab size constants to fit eachdistribution. In practice, this is an unmanageable and un-scalable solution, so instead Memcached uses many (44) slabclasses with exponentially growing sizes, in the hope of re-ducing allocation waste, especially for small sizes.

3.3 Temporal PatternsTo understand how production Memcached load varies

over time, we look at each trace’s transient request rate overits entire collection period (Fig. 3). All traces clearly showthe expected diurnal pattern, but with different values andamplitudes. If we increase our zoom factor further (as in thelast plot), we notice that traffic in ETC bottoms out around08:00 and has two peaks around 17:00 and 03:00. Not sur-prisingly, the hours immediately preceding 08:00 UTC (mid-night in Pacific Time) represent night time in the WesternHemisphere.

The first peak, on the other hand, occurs as North Amer-ica starts its day, while it is evening in Europe, and continuesuntil the later peak time for North America. Although dif-ferent traces (and sometimes even different days in the same

trace) differ in which of the two peaks is higher, the entireperiod between them, representing the Western Hemisphereday, sees the highest traffic volume. In terms of weekly pat-terns, we observe a small traffic drop on most Fridays andSaturdays, with traffic picking up again on Sundays andMondays.

The diurnal cycle represents load variation on the order of2×. We also observe the presence of traffic spikes. Typically,these can represent a swift surge in user interest on one topic,such as occur with major news or media events. Less fre-quently, these spikes stem from programmatic or operationalcauses. Either way, the implication for Memcached devel-opment and deployment is that one must budget individualnode capacity to allow for these spikes, which can easily dou-ble or even triple the normal peak request rate. Althoughsuch budgeting underutilizes resources during normal traf-fic, it is nevertheless imperative; otherwise, the many Webservers that would take to this sudden traffic and fail to geta prompt response from Memcached, would all query thesame database nodes. This scenario could be debilitating,so it must remain hypothetical.

4. CACHE BEHAVIORThe main metric used in evaluating cache efficacy is hit

rate: the percentage of GET requests that return a value.The overall hit rate of each server, as derived from the tracesand verified with Memcached’s own statistics, are shown inTable 2. This section takes a deeper look at the factorsthat influence these hit rates and how they relate to cachelocality, user behavior, temporal patterns, and Memcached’sdesign.

Table 2: Mean cache hit rate over entire trace.Pool APP VAR SYS USR ETC

Hit rate 92.9% 93.7% 98.7% 98.2% 81.4%

4.1 Hit Rates over TimeWhen looking at how hit rates vary over time (Fig. 4),

almost all traces show diurnal variance, within a small bandof a few percentage points. USR’s plot is curious: it appearsto be monotonically increasing (with diurnal undulation).This behavior stems from the usage model for USR. Recall

B.4Atikoglu,4et4al.,4“Workload4Analysis4 of4a4LargeUScale4KeyUValue4Store,”4ACM4SIGMETRICS42012.

here. It is important to note, however, that all Memcachedinstances in this study ran on identical hardware.

2.3 Tracing MethodologyOur analysis called for complete traces of traffic passing

through Memcached servers for at least a week. This taskis particularly challenging because it requires nonintrusiveinstrumentation of high-traffic volume production servers.Standard packet sniffers such as tcpdump2 have too muchoverhead to run under heavy load. We therefore imple-mented an efficient packet sniffer called mcap. Implementedas a Linux kernel module, mcap has several advantages overstandard packet sniffers: it accesses packet data in kernelspace directly and avoids additional memory copying; it in-troduces only 3% performance overhead (as opposed to tcp-dump’s 30%); and unlike standard sniffers, it handles out-of-order packets correctly by capturing incoming traffic af-ter all TCP processing is done. Consequently, mcap has acomplete view of what the Memcached server sees, whicheliminates the need for further processing of out-of-orderpackets. On the other hand, its packet parsing is optimizedfor Memcached packets, and would require adaptations forother applications.

The captured traces vary in size from 3TB to 7TB each.This data is too large to store locally on disk, adding anotherchallenge: how to offload this much data (at an average rateof more than 80, 000 samples per second) without interferingwith production traffic. We addressed this challenge by com-bining local disk buffering and dynamic offload throttling totake advantage of low-activity periods in the servers.

Finally, another challenge is this: how to effectively pro-cess these large data sets? We used Apache HIVE3 to ana-lyze Memcached traces. HIVE is part of the Hadoop frame-work that translates SQL-like queries into MapReduce jobs.We also used the Memcached “stats” command, as well asFacebook’s production logs, to verify that the statistics wecomputed, such as hit rates, are consistent with the aggre-gated operational metrics collected by these tools.

3. WORKLOAD CHARACTERISTICSThis section describes the observed properties of each trace

in terms of the requests that comprise it, their sizes, andtheir frequencies.

3.1 Request CompositionWe begin by looking at the basic data that comprises the

workload: the total number of requests in each server, bro-ken down by request types (Fig. 1). Several observationsdelineate the different usage of each pool:

USR handles significantly more GET requests than any ofthe other pools. GET operations comprise over 99.8%of this pool’s workload. One reason for this is that thepool is sized large enough to maximize hit rates, sorefreshing values is rarely necessary. These values arealso updated at a slower rate than some of the otherpools. The overall effect is that USR is used more likeRAM-based persistent storage than a cache.

APP has high GET rates too—owing to the popularity ofthis application—but also a large number of DELETE

2http://www.tcpdump.org/3http://hive.apache.org/

0

10000

20000

30000

40000

50000

60000

70000

USR APP ETC VAR SYS

Req

uest

s (m

illion

s)

Pool

DELETEUPDATE

GET

Figure 1: Distribution of request types per pool,over exactly 7 days. UPDATE commands aggregateall non-DELETE writing operations, such as SET,REPLACE, etc.

operations. DELETE operations occur when a cacheddatabase entry is modified (but not required to beset again in the cache). SET operations occur whenthe Web servers add a value to the cache. The rela-tively high number of DELETE operations show thatthis pool represents database-backed values that areaffected by frequent user modifications.

ETC has similar characteristics to APP, but with an evenhigher rate of DELETE requests (of which some maynot be currently cached). ETC is the largest and leastspecific of the pools, so its workloads might be the mostrepresentative to emulate. Because it is such a largeand heterogenous workload, we pay special attentionto this workload throughout the paper.

VAR is the only pool sampled that is write-dominated. Itstores short-term values such as browser-window sizefor opportunistic latency reduction. As such, thesevalues are not backed by a database (hence, no invali-dating DELETEs are required). But they change fre-quently, accounting for the high number of UPDATEs.

SYS is used to locate servers and services, not user data. Assuch, the number of requests scales with the numberof servers, not the number of user requests, which ismuch larger. This explains why the total number ofSYS requests is much smaller than the other pools’.

It is interesting to note that the ratio of GETs to UPDATEsin ETC (approximately 30 : 1) is significantly higher thanmost synthetic workloads typically assume (Sec. 7). Fordemand-filled caches like USR, where each miss is followedby an UPDATE, the ratios of GET to UPDATE operationsmentioned above are related to hit rate in general and thesizing of the cache to the data in particular. So in theory,one could justify any synthetic GET to UPDATE mix bycontrolling the cache size. But in practice, not all caches orkeys are demand-filled, and these caches are already sized tofit a real-world workload in a way that successfully tradesoff hit rates to cost.

3. An examination of various performance metrics overtime, showing diurnal and weekly patterns (Sec. 3.3,4.2.2, 6).

4. An analytical model that can be used to generate morerealistic synthetic workloads. We found that the salientsize characteristics follow power-law distributions, sim-ilar to other storage and Web-serving systems (Sec. 5).

5. An exposition of a Memcached deployment that canshed light on real-world, large-scale production usageof KV-stores (Sec. 2.2, 8).

The rest of this paper is organized as follows. We begin bydescribing the architecture of Memcached, its deploymentat Facebook, and how we analyzed its workload. Sec. 3presents the observed experimental properties of the tracedata (from the request point of view), while Sec. 4 describesthe observed cache metrics (from the server point of view).Sec. 5 presents a simple analytical model of the most rep-resentative workload. The next section brings the data to-gether in a discussion of our results, followed by a sectionsurveying previous efforts on analyzing cache behavior andworkload analysis.

2. MEMCACHED DESCRIPTION

2.1 ArchitectureMemcached1 is a simple, open-source software package

that exposes data in RAM to clients over the network. Asdata size grows in the application, more RAM can be addedto a server, or more servers can be added to the network.Additional servers generally only communicate with clients.Clients use consistent hashing [9] to select a unique serverper key, requiring only the knowledge of the total number ofservers and their IP addresses. This technique presents theentire aggregate data in the servers as a unified distributedhash table, keeps servers completely independent, and facil-itates scaling as data size grows.

Memcached’s interface provides the basic primitives thathash tables provide—insertion, deletion, and retrieval—aswell as more complex operations built atop them.

Data are stored as individual items, each including a key, avalue, and metadata. Item size can vary from a few bytes toover 100KB, heavily skewed toward smaller items (Sec. 3).Consequently, a naıve memory allocation scheme could re-sult in significant memory fragmentation. To address this is-sue, Memcached adopts a slab allocation technique, in whichmemory is divided into slabs of different sizes. The slabs ina class store items whose sizes are within the slab’s specificrange. A newly inserted item obtains its memory space byfirst searching the slab class corresponding to its size. Ifthis search fails, a new slab of the class is allocated fromthe heap. Symmetrically, when an item is deleted from thecache, its space is returned to the appropriate slab, ratherthan the heap. Memory is allocated to slab classes basedon the initial workload and its item sizes, until the heapis exhausted. Consequently, if the workload characteristicschange significantly after this initial phase, we may find thatthe slab allocation is inappropriate for the workload, result-ing in memory underutilization.

1http://memcached.org/

Table 1: Memcached pools sampled (in one cluster).These pools do not match their UNIX namesakes,but are used for illustrative purposes here insteadof their internal names.

Pool Size Description

USR few user-account status informationAPP dozens object metadata of one applicationETC hundreds nonspecific, general-purposeVAR dozens server-side browser informationSYS few system data on service location

A new item arriving after the heap is exhausted requiresthe eviction of an older item in the appropriate slab. Mem-cached uses the Least-Recently-Used (LRU) algorithm toselect the items for eviction. To this end, each slab classhas an LRU queue maintaining access history on its items.Although LRU decrees that any accessed item be moved tothe top of the queue, this version of Memcached coalescesrepeated accesses of the same item within a short period(one minute by default) and only moves this item to the topthe first time, to reduce overhead.

2.2 DeploymentFacebook relies on Memcached for fast access to frequently-

accessed values. Web servers typically try to read persistentvalues from Memcached before trying the slower backenddatabases. In many cases, the caches are demand-filled,meaning that generally, data is added to the cache aftera client has requested it and failed.

Modifications to persistent data in the database oftenpropagate as deletions (invalidations) to the Memcachedtier. Some cached data, however, is transient and not backedby persistent storage, requiring no invalidations.

Physically, Facebook deploys front-end servers in multipledatacenters, each containing one or more clusters of varyingsizes. Front-end clusters consist of both Web servers, run-ning primarily HipHop [31], and caching servers, runningprimarily Memcached. These servers are further subdividedbased on the concept of pools. A pool is a partition of theentire key space, defined by a prefix of the key, and typi-cally represents a separate application or data domain. Themain reason for separate domains (as opposed to one all-encompassing cache) is to ensure adequate quality of servicefor each domain. For example, one application with highturnover rate could evict keys of another application thatshares the same server, even if the latter has high temporallocality but lower access rates. Another reason to separatedomains is to facilitate application-specific capacity plan-ning and performance analysis.

In this paper, we describe traces from five separate pools—one trace from each pool (traces from separate machinesin the same pool exhibit similar characteristics). Thesepools represent a varied spectrum of application domainsand cache usage characteristics (Table 1). One pool in par-ticular, ETC, represents general cache usage of multiple ap-plications, and is also the largest of the pools; the data col-lected from this trace may be the most applicable to general-purpose KV-stores.

The focus of this paper is on workload characteristics,patterns, and relationships to social networking, so the exactdetails of server count and components have little relevance

. VPVNLNRP

USR4keys4are416B4or421B90%4of4VAR4keys4are431B

USR4values4are4only42B90%4of4values4are4smaller4than4500B

Page 5: User-space Network Processing

vwc *)>M gb*)*) ( 1%/-‐‑‒%*+ 1 t *.C

c EI +> agb+ *)2 ( *. *)/ t * NUXNT ( LNTP

** ii

5

Page 6: User-space Network Processing

L14(64KB)

L24(256KB)

6

CPU4Core

L14(64KB)

L24(256KB)

L14(64KB)

L24(256KB)

L14(64KB)

L24(256KB)

CPU4CoreL14(64KB)

L24(256KB)

CPU4Core

L14(64KB)

L24(256KB)

L14(64KB)

L24(256KB)

L14(64KB)

L24(256KB)

CPU4Core

LLC4(12MB)

44cycles

124cycles

444cycles

Memory4 (xx4GB)3004cyclesMatching4Tables

(4>4xx4MB)

Page 7: User-space Network Processing

Copyright420144NTT4Corporatonm x86 v n 7

Page 8: User-space Network Processing

2010$Sep.

Per$Packet)CPU)Cycles)for)10G

8

1,200 600

1,200 1,600Cycles'needed

Packet'I/O IPv4'lookup='1,800'cycles

='2,800

Yourbudget

1,400'cycles10G, min-sized packets, dual quad-core 2.66GHz CPUs

5,4001,200 … ='6,600

Packet'I/O IPv6'lookup

Packet'I/O Encryption'and'hashing

IPv4

IPv6

IPsec

+

+

+

(in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours)

S. Han, et al., “PacketShader: a GPU-accelerated Software Router,” SIGCOMM 2010.

※�������� �����

Page 9: User-space Network Processing

2010$Sep.

PacketShader:)psio I/O)Optimization

9

Packet'I/O

Packet'I/O

Packet'I/O

Packet'I/O

! 1,200'reduced'to'200'cycles'per'packet

! Main'ideas! Huge'packet'buffer! Batch'processing

600

1,600

IPv4'lookup='1,800'cycles

='2,800

5,400 … ='6,600

IPv6'lookup

Encryption'and'hashing

+

+

+

1,200

1,200

1,200

S. Han, et al., “PacketShader: a GPU-accelerated Software Router,” SIGCOMM 2010.

Page 10: User-space Network Processing

2010$Sep.

PacketShader:)GPU)Offloading

10

Packet'I/O

Packet'I/O

Packet'I/O

! GPU'Offloading'for! MemoryMintensive'or! ComputeMintensive'operations

! Main'topic'of'this'talk

600

1,600

IPv4'lookup

5,400 …

IPv6'lookup

Encryption'and'hashing

+

+

+

S. Han, et al., “PacketShader: a GPU-accelerated Software Router,” SIGCOMM 2010.

Page 11: User-space Network Processing

Kernel Uses the Most CPU Cycles

4

83% of CPU usage spent inside kernel!

Performance bottlenecks 1. Shared resources2. Broken locality3. Per packet processing

1) Efficient use of CPU cycles for TCP/IP processingÆ 2.35x more CPU cycles for app

2) 3x ~ 25x better performance

Bottleneck removed by mTCPKernel

(without TCP/IP)45%

Packet I/O4%

TCP/IP34%

Application17%

CPU Usage Breakdown of Web ServerWeb server (Lighttpd) Serving a 64 byte fileLinux-3.10

11

E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4Multicore4Systems,”4NSDI2014.

Page 12: User-space Network Processing

12

Inefficiencies in Kernel from Shared FD

1. Shared resources– Shared listening queue– Shared file descriptor space

5

Per-core packet queue

Receive-Side Scaling (H/W)

Core 0 Core 1 Core 3Core 2

Listening queue

Lock

File descriptor space

Linear search for finding empty slot

E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4Multicore4Systems,”4NSDI2014.

Page 13: User-space Network Processing

13

Inefficiencies in Kernel from Broken Locality

2. Broken locality

6

Per-core packet queue

Receive-Side Scaling (H/W)

Core 0 Core 1 Core 3Core 2

Interrupt handle

accept()read()write()

Interrupt handling core != accepting core

E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4Multicore4Systems,”4NSDI2014.

Page 14: User-space Network Processing

14

Inefficiencies in Kernel from Lack of Support for Batching

3. Per packet, per system call processing

Inefficient per packet processing

Frequent mode switchingCache pollution

Per packet memory allocation

Inefficient per system call processing

7

accept(), read(), write()

Packet I/O

Kernel TCP

Application thread

BSD socket LInux epollKernelUser

E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4Multicore4Systems,”4NSDI2014.

Page 15: User-space Network Processing

15

Overview of mTCP Architecture

10

1. Thread model: Pairwise, per-core threading2. Batching from packet I/O to application3. mTCP API: Easily portable API (BSD-like)

User-level packet I/O library (PSIO)

mTCP thread 0 mTCP thread 1

ApplicationThread 0

ApplicationThread 1

mTCP socket mTCP epoll

NIC device driver Kernel-level

1 23

User-level

Core 0 Core 1

• [SIGCOMM’10] PacketShader: A GPU-accelerated software router, http://shader.kaist.edu/packetshader/io_engine/index.html

E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4Multicore4Systems,”4NSDI2014.

Intel4DPDK

Page 16: User-space Network Processing

VPVNLNRP VH E•

– k u •l • zz h c.f.4SeaStar

16

main4thread

worker4thread

worker4thread

worker4thread

kernel

main4thread

worker4thread

worker4thread

worker4thread

mTCPthread

mTCPthread

mTCPthread

pipe

accept()

accept()

read()write()

read()write()

read()write()

read()write()

accept()

read()write()

accept()

read()write()

Page 17: User-space Network Processing

cb + E MLNT X MLNTb VH E

cb US R * -‐‑‒ + % 8 LNRP MP NRVL Tb CPVNLNRP * -‐‑‒ + -‐‑‒ % VN MP NRVL T

17

Hardware

CPU Intel Xeon E5-22430L/2.0GHz (6 core) x 2 sockets

Memory 48 GB PC3-12800

Ethernet Intel X520-SR1 (10 GbE)

Software

OS Debian GNU/Linux 8.1

kernel Linux 3.16.0-4-amd64

Intel DPDK 2.0.0

mTCP (4603a1a, June 7 2015)

Page 18: User-space Network Processing

US R

0

20

40

60

80

100

120

140

160

180

0 2 4 6 8 10 12

10004REQ

UESTS/SEC

OND

#CORESLinux SO_REUSEPORT mTCP

higher4 is4better

• Apache4benchmark• 64B4message• 10004concurrency• 100K4requests

3.3x

5.5x

18

Page 19: User-space Network Processing

VPVNLNRPc VN MP NRVL T FL S MP NRVL T

lc VH E v d

G<H .  d><H +  c u d

19

TCP$w/$1$thread

TCP$w/$3$threads

mTCPw/$1$thread

SET 85,404 146,3514(1.71) 115,1664(1.35)

GET 115,079 139,5754(1.21) 116,8384(1.02)

• mcUbenchmark• 64B4message• 5004concurrency• 100K4requests

Page 20: User-space Network Processing

VH E gc – v d

u k uz ls E A u

c 9G qP XUU 8E@ v dk P NX P US P S

P P l u dcc vc D@ E A w dc v v d

dH E(@E v z eE A e

20

Page 21: User-space Network Processing

• X86w

• vzh– z

• cpufreqUinfo(1) v v1/2

– cgroups CPU4throttling z– kXeon4Phil w z

• r FLARE Tilera v

21

Page 22: User-space Network Processing

o pSupachai Thongprasit

e[1]4S.4Thongprasit,4V.4Visoottiviseh,4and4R.4Takano,4“Toward4Fast4and4Scalable4KeyUValue4Stores4Based4on4User4Space4TCP/IP4Stack,”4AINTEC42015.4

d dk d l eu d ve

22