megakv: a case for gpus to maximize the throughput of · pdf filemegakv: a case for...

MegaKV: A Case for GPUs to Maximize the Throughput ofInMemory KeyValue Stores

Kai Zhang1 Kaibo Wang2 Yuan Yuan2 Lei Guo2 Rubao Lee2 Xiaodong Zhang2

1University of Science and Technology of China 2The Ohio State University

ABSTRACT

In-memory key-value stores play a critical role in data pro-cessing to provide high throughput and low latency dataaccesses. In-memory key-value stores have several uniqueproperties that include (1) data intensive operations de-manding high memory bandwidth for fast data accesses, (2)high data parallelism and simple computing operations de-manding many slim parallel computing units, and (3) a largeworking set. As data volume continues to increase, our ex-periments show that conventional and general-purpose mul-ticore systems are increasingly mismatched to the specialproperties of key-value stores because they do not providemassive data parallelism and high memory bandwidth; thepowerful but the limited number of computing cores do notsatisfy the demand of the unique data processing task; andthe cache hierarchy may not well benefit to the large workingset.In this paper, we make a strong case for GPUs to serve

as special-purpose devices to greatly accelerate the opera-tions of in-memory key-value stores. Specifically, we presentthe design and implementation of Mega-KV, a GPU-basedin-memory key-value store system that achieves high per-formance and high throughput. Effectively utilizing thehigh memory bandwidth and latency hiding capability ofGPUs, Mega-KV provides fast data accesses and signifi-cantly boosts overall performance. Running on a commodityPC installed with two CPUs and two GPUs, Mega-KV canprocess up to 160+ million key-value operations per second,which is 1.4-2.8 times as fast as the state-of-the-art key-valuestore system on a conventional CPU-based platform.

1. INTRODUCTIONThe decreasing prices and the increasing memory densi-

ties of DRAM have made it cost effective to build commod-ity servers with terabytes of DRAM [30]. This also makesit possible to economically build high performance SQL andNoSQL database systems to keep all (or nearly all) theirdata in main memory or at least to cache the applications’

This work is licensed under the Creative Commons AttributionNonCommercialNoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/byncnd/3.0/. Obtain permission prior to any use beyond those covered by the license. Contactcopyright holder by emailing [email protected]. Articles from this volumewere invited to present their results at the 41st International Conference onVery Large Data Bases, August 31st September 4th 2015, Kohala Coast,Hawaii.Proceedings of the VLDB Endowment, Vol. 8, No. 11Copyright 2015 VLDB Endowment 21508097/15/07.

working sets [9, 25, 26]. In-memory key-value store (IMKV)is a typical NoSQL data store that keeps data in memory forfast accesses to achieve high performance and high through-put. Representative systems include widely deployed opensource systems such as Memcached [2], Redis [3], RAM-Cloud [26] and recently developed high-performance pro-totypes, such as Masstree [22] and MICA [21]. As a crit-ical component in many Internet service systems, such asFacebook [25], YouTube, and Twitter, the IMKV system iscritical to provide quality services to end users with highthroughput. With the ever-increasing user populations andonline activities of Internet applications, the scale of data inthese systems is experiencing explosive growth. Therefore,a high performance IMKV system is highly demanded.

An IMKV is a highly data-intensive system. Upon receiv-ing a query from the network interface, it needs to locateand retrieve the object from memory through an index datastructure, which generally involves several memory accesses.Since a CPU only supports a small number of outstandingmemory accesses, an increasingly long delay is spent on wait-ing for data to be fetched from memory. Consequently, thehigh parallelism of query processing is hard to be exploredto hide the memory access latency. Furthermore, an IMKVsystem has a large working set [6]. As a result, the CPUcache would not help much to reduce the memory accesslatency due to its small capacity. Each memory access gen-erally takes 50-100 nanoseconds; however, the average timebudget for processing one query is only 10 nanoseconds fora 100 MOPS (Million Operations Per Second) system. Con-sequently, the relatively slow memory access speed signifi-cantly limits the throughput of IMKV systems.

In summary, IMKVs in data processing systems have threeunique properties: (1) data intensive operations demand-ing high memory bandwidth for fast data accesses, (2) highdata parallelism and simple computing demanding manyslim parallel computing units, and (3) a large working set.Unfortunately, we will later show that conventional general-purpose multicore systems are poorly matched to the uniqueproperties of key-value stores because they do not providemassive data parallelism and high memory bandwidth; thepowerful but the limited number of computing cores mis-match the demand of the special data processing [12]; andthe CPU cache hierarchy does not benefit the large workingset. Key-value stores demand simple but many computingunits for massive data parallel operations supported by highmemory bandwidth. These unique properties of IMKVs ex-actly match the unique capability of graphics processingunits (GPUs).

1226

In this paper, we make a strong case for GPUs to serveas special-purpose devices in IMKVs to offload and dra-matically accelerate the index operations. Specifically, wepresent the design and implementation of Mega-KV, a GPU-based in-memory key-value store system, to achieve highthroughput. Effectively utilizing the high memory band-width and latency hiding capability of GPUs, Mega-KVprovides fast data accesses and significantly boosts overallperformance. Our technical contributions are fourfold:

1. We have identified that the index operations are oneof the major overheads in IMKV processing, but arepoorly matched to conventional multicore architectures.The best choice to break this bottleneck is to shift thetask to a special architecture serving high data paral-lelism on high memory bandwidth.

2. We have designed an efficient IMKV called Mega-KVwhich offloads the index data structure and the corre-sponding operations to GPUs. With a GPU-optimizedhash table and a set of algorithms, Mega-KV best uti-lizes the unique GPU hardware capability to achieveunprecedented performance.

3. We have designed a periodical scheduling mechanismto achieve predictable latency with GPU processing.Different scheduling policies are applied on differentindex operations to minimize the response latency andmaximize the throughput.

4. We have intensively evaluated a complete in-memorykey-value store system that uses GPUs as its acceler-ators. Mega-KV achieves up to 160+ MOPS through-put with two off-the-shelf CPUs and GPUs, which is1.4-2.8 times as fast as the state-of-the-art key-valuestore system on a conventional CPU-based platform.

The roadmap of this paper is as follows. Section 2 intro-duces the background and motivation of this research. Sec-tion 3 outlines the overall structure of Mega-KV. Sections4 and 5 describe the GPU-optimized cuckoo hash table andthe scheduling policy, respectively. Section 6 describes theframework of Mega-KV and lists the major techniques usedin the implementation. Section 7 shows performance eval-uations, Section 8 introduces related work, and Section 9concludes the paper.

2. BACKGROUND AND MOTIVATION

2.1 An Analysis of KeyValue Store Processing

2.1.1 Workflow of a Keyvalue Store System

A typical in-memory key-value store system generally pro-vides three basic commands that serve as the interface toclients: 1) GET(key): retrieve the value associated with thekey. 2) SET/ADD/REPLACE(key, value): store the key-value item. 3) DELETE(key): delete the key-value item.Figure 1 shows the workflow of GET, SET, and DELETEoperations. Queries are first processed in the TCP/IP stack,and then parsed to extract the semantic information. If aGET query is received, the key is searched for in the indexdata structure to locate its value, which will be sent to therequesting client. For a SET query, a key-value item is allo-cated, or evicted if the system does not have enough mem-ory space, to store the new one. For a DELETE query, the

(III)

(IV)

(II)

(I)

Memory

Full

Memory

Not Full

DELETE SET GET

Extract Keys

TCP/IP Processing

Request Parsing

Extract Key&Value

Delete from Index

Evict Item

Insert into Index

Allocate item

Read&Send Value

Search in Index

Extract Keys

Figure 1: Workflow of a Typical Key-Value StoreSystem

key-value item is removed from both the main memory andthe index data structure. In summary, there are four majoroperations in the workflow of a key-value store system: (I)Network Processing: including network I/O and protocolparsing. (II) Memory Management: including memory allo-cation and item eviction. (III) Index Operations: includingSearch, Insert, and Delete. (IV) Read Key-Value Item inMemory: only for GET queries.

Overheads from network processing and concurrent dataaccess had been considered to be the major bottlenecks in akey-value store system [36]. 1) Network processing in a tra-ditional OS kernel, which involves frequent context switchesbetween the OS kernel and user space, causes high instruc-tion cache misses and virtual memory translation overhead.For instance, 70% of the processing time is spent on networkprocessing in CPHash [23], and MemC3 [11] suffers 7× per-formance degradation with network processing. A set oftechniques have been proposed to alleviate this overhead,such as UDP [6], Multiget [2], and bypassing the OS kernelwith new drivers [14, 1]. 2) The locks for synchronizationamong cores in concurrent data accesses can significantlyreduce the potential performance enhancement offered bymulticore architectures. In recent works, techniques such asdata sharding [21, 23] and optimized data structures [11,22] are proposed to tackle this issue. After the overheadis significantly mitigated with the proposed techniques, theperformance bottleneck of an IMKV system shifts. In thefollowing section, we will show that the performance gapbetween CPU and memory becomes the major factor thatlimits key-value store performance on the multicore archi-tecture.

2.1.2 Bottlenecks of Memory Accesses in a CPUBased KeyValue Store

Memory accesses in a key value store system consist of twomajor parts: 1) accessing the index data structure, and 2)accessing the stored key-value item. To gain an understand-ing of the impact of the two parts of memory accesses in anin-memory key-value store system, we have conducted ex-periments by measuring the execution time of a GET queryof MICA [21]. MICA is a CPU-based in-memory key-valuestore with the highest known throughput. In the evaluation,MICA adopts lossy index data structure and runs in EREW

1227

Exe

cu

tio

n

Pe

rce

nta

ge

0

0.25

0.5

0.75

1

The Four Data Sets

8B Key

8B Value

16B Key

64B Value

32B Key

512B Value

128B Key

1024B Value

Index Operation Access Value & Write Packet Other Costs

Figure 2: Execution Time Breakdown of a GETQuery in a CPU-Based Key-Value Store System

mode with a uniform key distribution. The following fourdata sets are used as the workloads: 1) 8 bytes key and 8bytes value; 2) 16 bytes key and 64 bytes value; 3) 32 byteskey and 512 bytes value; 4) 128 bytes key and 1,024 bytesvalue. This evaluation is conducted on a PC equipped withtwo Intel Xeon E5-2650v2 CPU and 8×8 GB memory. Asshown in Figure 2, index operations take about 75% of theprocessing time with the 8 bytes key-value workload (dataset 1), and take around 70% and 65% of the time for datasets 2 and 3, respectively. For data set 4, the value size in-creases to 1,024 bytes, and the index operation time portiondecreases, but still takes about 50% of the processing time.With DPDK and UDP protocol, receiving and parsing a

packet takes only about 70 nanoseconds, and the cost foreach key is significantly amortized with Multiget. For in-stance, eleven 128-byte keys can be packed in one packet(1500 bytes MTU) so that the amortized packet I/O andparsing cost for each key can be as low as only 7 nanosec-onds. MICA needs one or more memory accesses for its loss-less index, and one memory access for its lossy index datastructure. The key comparison in the index operation mayalso load the value stored next to the key. That is why theproportion of the accessing value is smaller although theyboth take one memory access for the data set 1. The CPI(cycles per instruction) of a CPU-intensive task is generallyconsidered to be less than 0.75. For example, the CPI ofLinpack on an Intel processor is about 0.42-0.59 [20]. Wehave measured that the CPI of MICA with the data set 1is 5.3, denoting that a key-value store is memory-intensive,and the CPU-memory gap has become the major factor thatlimits its performance.We have analyzed other popular key-value store systems,

and all of them show the same pattern. The huge over-head of accessing index data structure and key-value itemsis incurred by the memory accesses. The index data struc-ture commonly uses a hash table or a tree, which needs oneor more memory accesses to locate an item. With a hugeworking set, the index data structure may take hundreds ofmegabytes or even several gigabytes of memory space. Con-sequently, it cannot be kept in a CPU cache that has onlya capacity of tens of megabytes, and each access to the in-dex data structure may result in a cache miss. To locate avalue, it generally takes one or more random memory ac-cesses. Each memory access fetches a fixed size data blockinto a cache line in the CPU cache. For instance, with nitems, each lookup in Masstree [22] needs log4 n−1 randommemory accesses, and a cuckoo hash table with k hash func-tions would require (k + 1)/2 random memory accesses per

Tim

e

(na

no

se

co

nd

)

0

60

120

180

240

Number of Consecutive Memory Accesses

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Figure 3: Sequential Memory Access Time

index lookup in expectation. The ideal case for a linked-list hash table with load factor 1 is that items are evenlydistributed in the hash table and each lookup requires onlyone memory access. However, the expected worst case costis O(lg n/ lg lg n) [8]. On the other hand, accessing the key-value item is the next step, which consists of sequentialmemory accesses. For instance, a 1 KB key-value item needs16 sequential memory accesses with a cache line size of 64bytes. However, sequential memory accesses are much fasterthan random memory accesses, because the processor canrecognize the sequential pattern and prefetch the followingcache lines.

We have conducted another experiment to show the per-formance of sequential memory accesses. We start from ac-cessing one cache line (64 bytes), and continue to increasethe number of cache lines to read, up to 16 cache lines (1,024bytes). Figure 3 shows the time of 1-16 sequential memoryaccesses. A random memory access takes about 76 nanosec-onds in our machine, while 16 sequential memory accessestake 231 nanoseconds, only about 3 times higher than oneaccess. In conclusion, the random memory accesses involvedin accessing the index data structure may result in a hugeoverhead in an in-memory key-value store system.

Memory access overhead cannot be easily alleviated forthe following three technical reasons. 1) Memory access la-tency hiding capability for a multicore system is limited byits CPU instruction window size and the number of MissStatus Holding Registers (MSHRs). For example, an IntelX5550 CPU is capable of handling only 4-6 cache misses [14].Therefore, it is hard to utilize the inter-thread parallelism.2) As a thread cannot proceed without the information be-ing fetched from the memory, the intra-thread parallelismcan not be explored either. 3) The working set of an IMKVsystem is very large [6]. Therefore, the CPU cache with alimited capacity is helpless in reducing the memory accesslatency. With a huge amount of CPU time being spent onwaiting for memory to return the requested data, both CPUand memory bandwidth are underutilized. Since accessingthe key-value item is inevitable, the only way to significantlyimprove the performance of an in-memory key-value storesystem is to find a way to accelerate the random memoryaccesses in the index operations.

2.2 Opportunities and Challenges by GPUs

2.2.1 Advantages of GPUs for Keyvalue Stores

CPUs are general-purpose processors which feature largecache size and high single-core processing capability. In con-trast to CPUs, GPUs devote most of their die areas to largearray of Arithmetic Logic Units (ALUs), and execute code inan SIMD (Single Instruction, Multiple Data) fashion. With

1228

the massive array of ALUs, GPUs offer an order of magni-tude higher computational throughput than CPUs for ap-plications with ample parallelism. A key-value store systemhas the inherent massive parallelism where a large volumeof queries can be batched and processed simultaneously.GPUs are capable of offering much higher data accessing

throughputs than CPUs due to the following two features.First, GPUs have very high memory bandwidth. NvidiaGTX 780, for example, provides 288.4 GB/s memory band-width, while the most recent Intel Core E5-2680v3 processoronly has 68 GB/s memory bandwidth. Second, GPUs effec-tively hide memory access latency by warp switching. Warp(or wavefront called in OpenCL), the basic scheduling unitin Nvidia GPUs, can benefit zero-overhead scheduling bythe GPU hardware. When one warp is blocked by mem-ory accesses, other warps whose next instruction has itsoperands ready are eligible to be scheduled for execution.With enough threads, memory stalls can be minimized oreven eliminated [29].

2.2.2 Challenges of Using GPUs in Keyvalue Stores

GPUs have great capabilities to accelerate data-intensiveapplications. However, they have limitations and may incurextra overhead if utilized in an improper way.Challenge 1: Limited Memory Capacity and Data

Transfer Overhead. The capacity of GPU memory ismuch smaller than that of main memory [31]. For example,the memory size of a server-class Nvidia Tesla K40 GPU isonly 12 GB, while that of a data center server can be hun-dreds of gigabytes. Since a key-value store system generallyneeds to store tens of or hundreds of gigabytes key-valueitems, it is impossible to store all the data in the GPU mem-ory. With the low PCIe bandwidth, it is nontrivial to useGPUs in building a high performance IMKV.Challenge 2: Tradeoffs between Throughput and

Latency. To achieve high throughput, GPUs need databatching to improve resource utilization. A small batch sizefor a GPU will result in low throughput, while a large batchsize will lead to a high latency. However, a key-value storesystem is expected to offer a response latency of less than 1millisecond. Therefore, tradeoffs have to be made betweenthroughput and latency, and optimizations are needed tomatch IMKV workloads.Challenge 3: Specific Data Structure and Algo-

rithm Optimization on GPUs. Applications on GPUsrequire a well-organized data structure and efficient GPU-specific parallel algorithms. However, IMKV needs to pro-cess various-sized key-value pairs, which makes it hard towell utilize the SIMD vector units and the device memorybandwidth. Furthermore, as there are no global synchro-nization mechanisms for all threads in a GPU kernel, a bigchallenge is posed for algorithm design and implementation,such as the Insert operation.

3. MEGAKV: AN OVERVIEWTo address the memory access overhead in the IMKV sys-

tem, Mega-KV offloads the index data structure and its cor-responding operations to GPUs. With GPUs’ architectureadvancement on high memory bandwidth and massive par-allelism, the performance of index operations is significantlyimproved, and the load of CPU is dramatically lightened.

3.1 Major Design Choices

Receiver 1

Receiver 2

Compress

keys to key

signatures

Hash Table in GPU

Device Memory

Sender 1

Sender 2

Convert

locations

to pointers

Slab Subsystem

Signature Algorithm

SchedulerGPU

Figure 4: The Workflow of Mega-KV System

Decoupling Index Data Structure from Key-ValueItems. Due to the limited GPU device memory size, thenumber of key-value items that can be stored in the GPUmemory is very small. Furthermore, transferring data be-tween GPU memory and host memory is considered to bethe major bottleneck for GPU execution. Mega-KV decou-ples index data structure from key-value items, and storesit in the GPU memory. In this way, the expensive index op-erations such as Search, Insert, and Delete can be offloadedto GPUs, significantly mitigating the load of CPUs.

GPU-optimized Cuckoo Hash Table as the IndexData Structure. Mega-KV uses a GPU-optimized cuckoohash table [27] as its index data structure. According toGPUs’ hardware characteristics, the cuckoo hash table datastructure is designed with aligned and fixed-size cells andbuckets for higher parallelism and less memory accesses.Since keys and values have variable lengths, keys are com-pressed into 32-bit key signatures, and the location of thekey-value item in main memory is indicated by a 32-bit lo-cation ID. The key signature and location ID serve as theinput and output of the GPU-based index operations, re-spectively.

Periodic GPU Scheduling for Bounded Latency. Akey-value store system has a stringent latency requirementfor queries. For a guaranteed query processing time, Mega-KV launches GPU kernels in pre-defined time intervals. Ateach scheduled time point, jobs accumulated in the previ-ous batch are launched for GPU processing. GET queriesneed fast responses for quality of services, while SET andDELETE queries have a less strict requirement. Therefore,Mega-KV applies different scheduling policies on differenttypes of queries for higher throughput and lower latency.

3.2 The Workflow of MegaKVFigure 4 shows the workflow of Mega-KV in a CPU-GPU

hybrid system. Mega-KV divides query processing into threestages: pre-processing, GPU processing, and post-processing,which are handled by three kinds of threads, respectively.Receiver threads are in charge of the pre-processing stage,which consists of packet parsing, memory allocation andeviction, and some extra work for batching, i.e., components(I) and (II) in Figure 1. Receivers batch Search, Insert and

1229

N key signatures N value locations

…

… …

Hash Sig … …Loc

Input Output

…

0 1 2 N-1 0 1 2 N-1

Bucket 1

Bucket 2

Figure 5: GPU Cuckoo Hash Table Data Structure

Delete jobs separately into three buffers. The batched inputis 32-bit key signatures which are calculated by perform-ing a signature algorithm on the keys. Scheduler, as thecentral commander, is a thread that is in charge of peri-odic scheduling. It launches the GPU kernel after a fixedtime interval to process the query operations batched in theprevious time window. Sender threads handle the post-processing stage, including locating the key-value items withthe indexes received from the GPU, and sending responsesto clients. Mega-KV uses slab memory management whereeach slab object is assigned with a 32-bit location ID. Afterthe key-value item locations for all the Search jobs are re-turned, Sender threads convert the locations to item point-ers through the slab subsystem (see Section 6.2 for details).Because the overhead from packet I/O, query parsing, andthe memory management is still high, several Receiver andSender threads are launched in pairs to form the pipelines,while there is only one Scheduler per GPU.

4. GPUOPTIMIZED CUCKOO HASH TA

BLE

4.1 Data StructureSince the GPU memory capacity is small, the hash table

data structure should be efficiently designed for indexing alarge number of key-value items. Mega-KV adopts cuckoohashing, which features a high load factor and a constantlookup time. The basic idea of cuckoo hashing is to usemultiple hash functions to provide each key with multiplelocations instead of one. When a bucket is full, existingkeys are relocated to make room for the new key.There are two parameters affecting the load factor of a

cuckoo hash table and the eviction times for insertion: thenumber of hash functions, and the number of cells in abucket. Increasing either of them can lead to a high loadfactor [10]. Since the GPU memory size is limited and therandom memory access overhead of a cuckoo eviction is high,we use a small number of hash functions (two) and a largenumber of cells per bucket in our hash table design.The various-sized keys and values not only impose a huge

overhead on data transfer, but also make it hard for GPUthreads to locate their data. In our hash table design, a32-bit key signature is used to identify a key, and a 32-bitlocation ID is used to reference the location of an item inthe main memory. As shown in Figure 5, a hash bucketcontains N cells, each of which stores a key signature andthe location of the corresponding key-value item. The keysignatures and locations are packed separately for coalescedmemory access. Each key is hashed onto two buckets, where

a 32-bit hash value is used as the index of one bucket, andthe index of the other bucket is calculated by performing anXOR operation on the signature and the hash value. Thecompactness of the key signatures and locations also lead toa small hash table size. For instance, for an average key-value item size of 512 bytes, a 3 GB hash table is capable ofindexing 192 GB data.

The overhead of generating key signatures and hash valuesis small with the built-in SSE instructions of Intel x86 CPU,such as AES, CRC, or XOR. In a 2 GB hash table with 8 cells(one cell has one key signature and its location) per bucket,there will be 225 buckets with 25 bits for hashing. With 32bits for signature, the hash table can hold up to 231/8 = 228

elements, with a collision rate of as low as 1/225+32 = 1/257.

4.2 Hash Table OperationsA straightforward implementation of the hash table oper-

ations on GPUs may result in low resource utilization, andconsequently low performance. As the L2 cache size of aGPU is very small (1.5 MB for Nvidia GTX 780), only afew buckets can be operated at the same time. When tensof thousands of threads are operating on different bucketssimultaneously, the fetched memory block may be imme-diately evicted out of the cache after one access. Conse-quently, if a thread is trying to match all the signatures ina bucket, the fetched memory block has a high possibilityof being evicted after each access; thus it has to be fetchedfrom memory repeatedly. To address this issue, we proposeto form multiple threads as a processing unit for cooperativeprocessing. With a bucket of N cells, N threads can forma processing unit to check all the cells in the bucket to findan empty one simultaneously. Since the threads forming aprocessing unit are selected within one warp, they performthe operations simultaneously. Therefore, after the opera-tions are performed, the data is no longer needed and canbe evicted from the cache. Based on this approach, this sec-tion describes the specific hash table operations optimizedfor GPU execution.

Search: A Search operation checks all 2×N candidate keysignatures in the two buckets, and writes the correspondinglocation into the output buffer. When searching for a key,the threads in the processing unit compare the signaturesin the bucket in parallel. After that, the threads use thebuilt-in vote function ballot() to inform each other with theinformation of the corresponding cells. With the ballot()result, all the threads in the processing unit know if there isa match in the current bucket and its position. If none of thethreads find a match, they will do the same process on thealternative bucket. If there is a match, the correspondingthread will write the location ID to the output buffer.

Insert: For Insert operation, the processing unit firstlytries to find if there are the same signatures in the two buck-ets, i.e., conflicts. If there are conflicts, the conflicting loca-tion is replaced with the new one, or the processing unit willtry to find an empty cell in the two buckets. With ballot(),all threads will know the positions of all the empty cells. Ifeither of the two buckets has an empty cell, the thread thatis in charge of the corresponding cell tries to insert the key-value pair. After the insertion, synchronize() is performedto make sure all memory transactions have been done withinthe thread block for checking whether the insertion is suc-cessful. If not, the processing unit will try again. There willbe at least one successful insertion for the conflict in a cell.

1230

Exe

cu

tio

n T

ime

(mic

rose

co

nd

)

0

220

440

660

880

1100

#Thread Per Unit

1 2 4 8 16 32

4 cells 8 cells 16 cells 32 cells

Figure 6: Processing time of Search operation with60K jobs and 7 CUDA streams

If neither bucket has empty cells, a randomly selectedcell from one candidate bucket is relocated to its alterna-tive location. Displacing the key may also require kickingout another existing key, which will repeat until a vacantcell is found, or until a maximum number of displacementsis reached.There may be severe write-write conflicts if multiple pro-

cessing units are trying to insert an item in the same posi-tion. To alleviate the conflict, instead of always trying toinsert into the first available cell in a bucket, a preferred cellis assigned for each key. We use the highest bits of a signa-ture to index its preferred cell. For instance, 3 bits can beused to indicate the preferred cell in a bucket with 8 cells.If multiple available cells in a bucket are found, the cell leftnearest to the preferred cell is chosen for insertion. Therewill be no extra communication between the threads in aprocessing unit, since each of them knows whether its cell ischosen with the ballot() result.Delete: Delete operation is almost the same with Search.

When both the signature and location are matched, the cor-responding thread clears the signature to zero to mark thecell as available.

4.3 Hash Table Optimization and PerformanceIn this section, we use a Nvidia GTX 780 in evaluating

the hash table performance.

4.3.1 The Choice of Processing Unit and Bucket Size

To evaluate the effectiveness of processing unit, Figure 6shows the GPU execution time for Search operation witha different number of cells in one bucket and a differentnumber of threads in a processing unit. Since a memorytransaction is 32 bytes in the non-caching mode of NvidiaGPUs, the bucket size is set as multiples of 32 bytes to effi-ciently utilize memory bandwidth. As shown in the figure,decreasing the number of cells and increasing the numberof threads in processing unit lead to a reduced executiontime. We choose 8 cells for each bucket, which allows ahigher load factor. And correspondingly, 8 threads form aprocessing unit.

4.3.2 Optimization for Insert

Insert operation needs to make sure whether the key-valueindex has been successfully inserted, and a synchronize()operation is performed after the insertion. However, this op-eration can only synchronize threads within a thread block,but cannot synchronize threads in different thread blocks.A naive approach to implementing Insert operation is tolaunch only one thread block. However, a thread block can

Sp

ee

du

p

1

1.3

1.6

1.9

Batch Size

2K 4K 6K 8K 10K 12K 14K 16K 18K 20K

2 insert block 4 insert block 8 insert block

16 insert block 32 insert block

Figure 7: Speedups for Insert with Multiple Blocks

(B) Search - Large Batch

La

ten

cy

(mic

rose

co

nd

)

0

100

200

300

400

500

0

70

140

210

280

350

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

100,000

Throughput - 1 CUDA Stream Throughput - 7 CUDA Stream

Latency - 1 CUDA Stream Latency - 7 CUDA Stream

(D) Insert - Large Batch

La

ten

cy

(mic

rose

co

nd

)

0

225

450

675

900

0

60

120

180

240

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

100,000

(F) Delete - Large Batch

La

ten

cy

(mic

rose

co

nd

)

0

175

350

525

700

0

50

100

150

200

Batch Size

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

100,000

(A) Search - Small Batch

20

45

70

95

120

Th

rou

gh

pu

t (M

OP

S)

0

40

80

120

160

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

(C) Insert - Small Batch

20

50

80

110

140

Th

rou

gh

pu

t (M

OP

S)

0

25

50

75

100

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

(E) Delete - Small Batch

20

45

70

95

120

Th

rou

gh

pu

t (M

OP

S)

0

35

70

105

140

Batch Size

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

Figure 8: GPU Hash Table Performance

only execute on one streaming multiprocessor, leading toresource underutilization and low performance.

We divide the hash table into several logical partitions.According to the hash value, key signatures are batched intodifferent buffers with each buffer belonging to a logical parti-tion exclusively. For the alternative bucket in cuckoo hash,a mask is used to make it still locate in the same parti-tion. With each thread block processing one input bufferand one logical partition exclusively, throughput of Insertoperation is boosted by utilizing more streaming processors.Figure 7 shows the performance improvement with multipleinsert blocks. As can be seen in the figure, an average of1.2 speedup is achieved with 2 blocks, and the throughputbecomes 1.3-1.8 times higher with 16 blocks.

4.3.3 Hash Table Performance

Figure 8 shows the throughput and processing latencyfor hash table operations with different input batch size.Comparing with 1 CUDA stream, a 24%-60% performanceimprovement is achieved with 7 CUDA streams, which ef-

1231

31 2 4 65 7

C C C C C C

TS

Tmax

Figure 9: Periodical Delayed Scheduling

fectively overlaps kernel execution with data transfer. Ourhash table implementation shows the peak performance forSearch, Insert, and Delete operations with large batch sizesare 303.7 MOPS, 210.3 MOPS, and 196.5 MOPS, respec-tively. For a small batch, the performance of 7 streams islower than that of 1 stream. This is because GPU resourcesare underutilized for a small batch, which is partitioned intosmaller ones with the multiple CUDA streams.

4.4 The Order of OperationsThe possibility of conflicting operations on the same key

within such a microsecond scale time interval is very small,and there is no need to guarantee their orders. This is be-cause, in multicore-based key-value store systems, the op-erations may be processed by different threads, which mayhave context switching and compete for a lock to performthe operations. Moreover, the workload of each thread is dif-ferent and the delays of packets transferred in the networkalso vary over a large range. For example, TCP processingmay take up to thousands of milliseconds [18], and queriesmay wait for tens of milliseconds to be scheduled in a dataprocessing system [25]. Therefore, the order of operationswithin only hundreds of microseconds can be ignored, andthe order of operations within one GPU batch is not guar-anteed in Mega-KV. For the conflicts on accessing the sameitem among the CPU threads, a set of optimistic concur-rency control mechanisms are proposed in the implementa-tion of Mega-KV (Section 6.4).

5. SCHEDULING POLICYIn this section, we study the GPU scheduling policy to

balance between throughput and latency.

5.1 Periodical SchedulingTo achieve a bounded latency for GPU processing, we pro-

pose a periodical scheduling mechanism based on the follow-ing three observations. First, the majority of the hash tableoperations are Search in a typical workload, while Insert andDelete operations account for a small fraction. For example,a 30:1 GET/SET ratio is reported in Facebook Memcachedworkload [6]. Second, as shown in Section 4.3, the process-ing time for Insert and Delete operations increase very slowlywhen the batch size is small. Third, SET queries have lessstrict latency requirement than that of GET queries.Different scheduling cycles are applied on Search, Insert,

and Delete operations, respectively. Search operations arelaunched for GPU processing after a query batch time C.

La

ten

cy (

us)

0

300

600

900

1200

Insert/Delete Batch Size

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

11,000

12,000

13,000

14,000

15,000

16,000

17,000

18,000

19,000

20,000

Insert + Delete Search

y = 5.7*10-2x’ + 37.4(y = 3.0*10-3x + 37.4)

y = 5.6*10-3x + 136.9

Figure 10: Fitting Line for Search and Insert/DeleteOperation (95% GET, 5% SET)

Insert and Delete operations, however, are processed for ev-ery n × C. We define the GPU execution time for Searchoperations batched in C as TS . In Mega-KV, we assumethat GPUs are capable of handling the input queries, andwe can get TS < C. Figure 9 shows an example with n = 2.In the example, Search operations accumulated in the lasttime interval C are processed in the next time interval, whileInsert and Delete operations are launched for execution ev-ery 2×C. We define the sum of TS and the time for Insertand Delete operations batched in n×C as Tmax. To guaran-tee that Search operations that have been batched in a timeinterval C can be processed within the next time interval,the following rule should be satisfied:

TS + Tmax ≤ 2× C (1)

With the time for reading the value and sending the re-sponse, a maximum response time of 3 × C is expected forthe GET queries.

5.2 Lower Bound of Scheduling CycleFigure 10 shows the fitting lines for the relation of pro-

cessing time and batch size for Search and Insert & Deleteoperations on Nvidia GTX 780. The workload has 95% GETand 5% SET queries with a uniform distribution. Both linesare drawn with the same horizontal axis, which is the batchsize for Insert. With an Insert batch size x, the correspond-ing batch size for Search is 19x. As is shown in the figure, theprocessing time of Search operations increases more quicklythan Insert/Delete operations with the growth of the batchsize, which also proves that Insert and Delete operationsare fit to be batched for a longer time in read heavy IMKVsystems.

Since the processing time TS of Search operation almostincreases linearly with the increase of the batch size xS , wedefine it as TS = kS ·xS+bS , where kS = 3.0×10−3 and bS =37.4. Similarly, for the Insert/Delete operations, we defineits relation between batch size xI and processing time TI asTI = kI ·xI+bI , where kI = 5.6×10−3 and bI = 136.9. Witha fixed input speed V , the relations between the batchingtime t and the processing time are TS = kS · pS · V · t + bSand TI = kI · pI · V · t + bI , where pS is the proportionof Search operations, and pI = 1 − pS is the proportionof Insert operations, which are 95% and 5% in the figure,respectively.

The maximum processing time Tmax = (kS · pS · V · C +bS)+ (kI · (1− pS) ·V ·C ·n+ bI). To satisfy the inequationTS + Tmax ≤ 2× C, we get

C ≥2 · bS + bI

2− 2 · kS · pS · V − n · kI · (1− pS) · V(2)

1232

Imp

rove

me

nt

1

1.2

1.4

1.6

1.8

2

Min

imu

m C

(m

icro

se

co

nd

)

0

250

500

750

1000

Input Speed (MOPS)

20 40 60 80 100 120 140 160 180 200 220 240 260

Scheduling Cycle Lower Bound without Delayed Scheduling

Scheduling Cycle Lower Bound with 1 Batch Delayed Scheduling

Improvement

Figure 11: Latency Improvement with DelayedScheduling

GPU processing

Pre-processing Post-processing

Buffer 1-C

Buffer 2-C

Buffer 2-B

Buffer 1-B

Buffer 2-A

Buffer 1-A

GPU

Sender

2

Sender

1

Receiver

2

Receiver

1

PCIE

Hash Table in GPU Device

MemoryScheduler

Figure 12: The Framework of Mega-KV System

where V < 2/(2 ·kS ·pS +n ·kI · (1−pS)) , and n ≥ 2. Fromthe formula, we learn that an increasing n leads to a largerlower bound of C. Therefore, with the same input speed,we get the minimum C with n = 2.Without the delayed scheduling, C should follow C ≥

Tmax, and we get C ≥ (bS+bI)/(1−kS ·pS ·V −kI ·(1−pS)·V ),where V < 1/(kS · pS + kI · (1 − pS)). Figure 11 shows thereduction on the lower bound of scheduling cycle C withthe delayed scheduling, where an average of 65% reductionis achieved. The delayed scheduling policy not only offers areduced overall latency, but also makes the system capableof achieving a higher throughput with the same latency.

6. SYSTEM IMPLEMENTATION AND OP

TIMIZATIONTo build a high performance key value store that is capa-

ble of processing hundreds of millions of queries per second,the overheads from data copy operations, memory manage-ment, and locks should be addressed. In this section, weillustrate the framework of Mega-KV and the major tech-niques used in alleviating the overheads.

6.1 Zerocopy PipeliningData copy is known to be a big overhead in a high-speed

networking system, which may limit the overall system per-formance. To avoid the expensive memory copy operationsbetween pipeline stages, each pipeline is assigned with threebuffers, and data between the stages is transferred by pass-ing the buffers. Figure 12 shows the zero-copy pipeliningframework of Mega-KV . At any time, each Receiver works

on one buffer to batch incoming queries. When GPU kernellaunching time arrives, Scheduler uses an available buffer toswap the one that Receiver is working on. In a system con-figured with N Receivers, N CUDA streams are launched toprocess the buffer of each Receiver. After the GPU kernelcompletes execution, Scheduler handles the buffer to Senderfor post-processing. Sender marks its buffer as available af-ter it completes the post-processing. With this technique,the overhead of data transferring between pipeline stages issignificantly mitigated.

6.2 Memory ManagementSlab Memory Management. Mega-KV uses slab allo-

cation. The location of an item, which is 32 bits, is used inthe hash table to reference where the item is located in themain memory. Through the slab data structure, a mappingis made between the location and the corresponding item,where the highest bits in the location are used to indicatewhich slab it belongs to, and the rest of the bits stand forthe offset.

CLOCK Eviction. Each slab adopts a bitmap-basedCLOCK eviction mechanism where the item offset in a bitmapis the same with its offset in the slab. A walker pointer tra-verses the bitmap and performs the CLOCK algorithm foreviction. By tightly associating location with the slab andbitmap data structures, both locating an item and updatingthe CLOCK bitmap can be performed with extremely lowoverhead.

If there is a conflict when inserting an item in the hashtable, the conflicting location is replaced with the new one(Section 4.2), and the conflicting item stored in main mem-ory should be evicted for memory reuse. With the CLOCKeviction mechanism, the conflicting items in the main mem-ory will be evicted after it is not accessed for a period.Therefore, no further actions need to be performed on theconflicting items. This also works for the items that are ran-domly evicted when a maximum number of displacementsis reached in cuckoo hash insertion.

Batched Lock. With a shared memory design amongall threads, synchronization is needed for memory alloca-tion and eviction. Since acquiring a lock in the critical pathof query processing has a huge impact on the overall perfor-mance, batched allocation and eviction are adopted to mit-igate its overhead. Each allocation or eviction will returna memory chunk, containing a list of fixed-size items. Cor-respondingly, each Receiver thread maintains a local slablist for storing the allocated and evicted items. By amortiz-ing the lock overhead across hundreds of items, the perfor-mance of memory management subsystem is dramaticallyimproved.

6.3 APIs: get and getkThe same as Memcached, Mega-KV has two APIs for

GET: get and getk. When a get query is received, Mega-KV is responsible for making sure that the value sent to theclient matches the key. Therefore, before sending a foundvalue to clients, its key stored in the main memory is com-pared to confirm the match. If the keys are the same, thevalue is sent to the client, or NOT FOUND is sent to notifythat the key-value item is not stored in Mega-KV.

Since a getk query asks a key-value store to send the keywith the value to the client, the client is capable of matchingthe key with its value. Therefore, Mega-KV does not com-

1233

pare the keys to confirm the match, and requires its clientto do the job. Our design choice is mainly based on two fac-tors: 1) the false positive rate/conflict rate is very low; and2) the key comparison cost is comparatively high. There-fore, avoiding the key comparison operation for each querywill lead to a higher performance.

6.4 Optimistic Concurrent AccessesTo avoid adopting locks in the critical path of query pro-

cessing, the following optimistic concurrent accesses mecha-nisms are applied in Mega-KV.Eviction An item cannot be evicted under the following

two situations. First, an free slab item that has not been al-located should not be evicted. Second, if an item is deletedand recycled to a free list, it should not be evicted. To han-dle the two situations, we assign a status tag to each item.Items in the free lists are marked as free, and the allocatedslab items are marked as using. The eviction process checksthe item’s status, and will only evict items with a status tagof using.Reading vs. Writing. An item may be evicted when

other threads are reading the value. Under such a scenario,the thread checks the status tag of the item after finishingreading its value. If the status tag is not using any more, theitem has already been evicted. Since the value read may bewrong, a NOT FOUND response will be sent to the client.Buffer Swapping. Receiver does not know when Sched-

uler swaps its buffer. Therefore, a query may not be success-fully batched in the buffer if the buffer is swapped before thelast moment. To address this issue without using locks, werecord the buffer ID before a query is added into the buffer,and check if the buffer has been swapped after the insertion.If the buffer has been swapped during the process, Receiveris not sure whether the query has been successfully inserted,and the query is added into the new buffer again.For Search operation, if the buffer has been swapped, the

total number of queries in the buffer is not increased sothat the new query, whether or not it has been added intothe buffer, will not be processed by the Sender. For Insertoperation, the object can be inserted twice, as the latter onewill overwrite the former one. Therefore, the correctness willnot be affected, and so does the Delete operation.

7. EXPERIMENTSIn this section, we evaluate the performance and latency of

Mega-KV under a variety of workloads and configurations.We show that Mega-KV achieves a super high throughputwith a reasonably low latency.

7.1 Experimental MethodologyHardware. We conduct the experiments on a PC equipped

with two Intel Xeon E5-2650v2 octa-core processors run-ning at 2.6 GHz. Each processor has an integrated memorycontroller installed with 8×8 GB 1600MHz DDR3 memory,and supports a maximum of 59.7 GB/s memory bandwidth.Each socket is installed with a Nvidia GTX 780 GPU as ouraccelerator. GTX 780 has 12 streaming multiprocessors anda total of 2,304 cores. The device memory on each GPU is 3GB GDDR5, and the maximum data transfer rates betweenmain memory and device memory are 11.5 GB/s (Host-to-Device) and 12.9 GB/s (Device-to-Host). The operating sys-tem is 64-bit Ubuntu Server 14.04 with Linux kernel version3.13.0-35. Each socket is installed with an Intel 82599 dual

port 10 GbE card, and the open source DPDK [1] is used asthe driver for high-speed I/O. We also use another two PCsin the experiments as clients.

Mega-KV Configuration. For the octa-core CPU oneach socket, one physical core is assigned with a Schedulerthread. Each Scheduler controls all the other threads on thesocket and launches kernels to its local GPU. Since Receiveris compute-intensive while Sender is memory-intensive, weenable hyper-threading in the machine, and assign each ofthe left 7 physical cores with one Receiver and one Senderthread, forming a pipeline on the same physical core. There-fore, there are 7 pipelines in our system on each socket. TheAES instruction from the SSE instruction set is used in cal-culating key signature and hash value.

Data sharding is adopted for the NUMA system. By par-titioning data across the sockets, each GPU will only indexthe key-value items located in its local socket. This avoidsremote memory accesses, which are considered to be a bigoverhead. In this way, Mega-KV achieves good scalabilitywith multiple CPUs and GPUs.

Comparison with a CPU-based IMKV system. Wetake the open source in-memory key-value store MICA asour baseline for comparison. According to MICA’s exper-imental environment, the hyper-threading is disabled. Wemodify the microbench in MICA for the evaluation, whichincludes loading different size key-value items from localmemory and writing values to packet buffers. All the ex-periments are performed in its EREW mode with MICA’slossy index data structure. On the same hardware platformwith Mega-KV, the performance of MICA is measured andshown in Section 7.2.3.

Workloads. We use four datasets in the evaluation: a)8 bytes key and 8 bytes value; b) 16 bytes key and 64 bytesvalue; c) 32 bytes key and 512 bytes value; and d) 128 byteskey and 1024 bytes value. In the following experiments,workload a is evaluated by feeding queries via network. Toallow a high query speed via network transmission, clientsbatch requests and Megak-KV batches responses in an Eth-ernet frame as much as possible. For the large key-valuepair workloads b, c and d, the network becomes the bottle-neck. Thus, keys and values are generated in local memoryto evaluate the performance.

Both uniform and skewed workloads are used in the eval-uation. The uniform workload uses the same key popularityfor all queries. The key popularity in the skewed workloadfollows a Zipf distribution of skewness 0.99, which is thesame with YCSB workload [7]. Our clients use approxima-tion Zipf distribution generation described in [13] for fastworkload generation. For all the workloads, the working setsize is 32 GB, and a 2 GB GPU hash table is used to in-dex the key-value items. The key-value items are preloadedin the evaluation, and the hit ratio of GET query is higherthan 99.9%.

7.2 System Throughput

7.2.1 Theoretical Maximum System Throughput

As discussed in Section 5, the system throughput V isclosely related to the scheduling cycle C. Since our systemneeds to guarantee that the GPU processing time for eachbatch is less than the scheduling cycle C, the theoreticalmaximum throughput should be known in advance beforethe evaluation.

1234

Ma

xim

um

T

hro

ug

hp

ut (M

OP

S)

0

52

104

156

208

260

Scheduling Cycle C (microsecond)

120140

160180

200220

240260

280300

320340

360380

400420

440460

480500

Throughput without Delayed Scheduling

Throughput with 1 Batch Delayed Scheduling

xxx

Figure 13: Theoretical Maximum Speed for OneGPU

Th

rou

gh

pu

t (M

OP

S)

0

40

80

120

160

Scheduling Cycle C (us)

(a) Mega-KV w/ getk

120 140 160+

100% GET Uniform 95% GET, 5% SET Uniform

100% GET Skewed 95% GET, 5% SET Skewed

0

34

68

102

136

Scheduling Cycle C (us)

(b) Mega-KV w/ get

120 140 160+

Figure 14: System Throughput with Two CPUs andGPUs (8 Bytes Key & Value)

The major mission of the periodical scheduling policy isto satisfy the inequation (1) TS + Tmax ≤ 2× C. With thedelayed scheduling, we get

V ≤2 · C − 2 · bS − bI

2 · kS · pS · C + 2 · kI · (1− pS) · C(3)

where C > (2 · bS + bI)/2. Without the delayed scheduling,V and C form the relation 0 ≥ V ≥ (C − bS − bI)/(kS · pS ·

C + kI · (1− pS) ·C), where C > bS + bI . For the 95% GETworkload, we get C > 105.8 with the delayed scheduling,and C > 174.3 without the delayed scheduling.With the same parameters kS , bS , kI , and bI listed in Sec-

tion 5.2, Figure 13 demonstrates the relation between theallowed maximum throughput and the scheduling cycle Cfor the workload with 95% GET and 5% SET. The theoret-ical maximum throughput of Mega-KV increases with theincreasing of the scheduling cycle, and will reach an upperbound with a large enough scheduling cycle, which is themaximum GPU throughput. Compared with scheduling alloperations with the same scheduling cycle, system perfor-mance is improved by 14%-400% with delayed Insert andDelete scheduling. It is worth noting that for C ≤ 174.3microseconds, the system cannot work without the delayedscheduling. This is because the total GPU execution anddata transfer overhead for Search, Insert, and Delete ker-nels is higher than the scheduling cycle when the batch sizeis small. With the delayed scheduling, the scheduling cycleC is required to be greater than 105.8 microseconds.

7.2.2 System Throughput with Network Processing

To measure the maximum throughput of Mega-KV, weuse workload a where both keys and values are 8 bytes sothat the network is capable of transferring a huge amountof queries. This experiment is performed with network pro-cessing, where Mega-KV receives requests from clients andsends back responses through the NIC ports. In the experi-ments, we find that the throughput that GPUs offer is muchhigher than that of CPUs in the pipeline of Mega-KV. Afterthe expensive index operations are offloaded to GPUs, thememory accesses to the key-value items become the majorfactor that limits the system performance.

Figure 14 plots the throughput of Mega-KV with work-load a. Both the performance of getk query and get queryare measured. With getk query, Mega-KV achieves the max-imum throughput of 160 MOPS for 95% GET and 5% SET,and 166 MOPS for 100% GET skewed workload, respec-tively. With get query, the throughput of Mega-KV is 120MOPS (95% GET and 5% SET) and 124 MOPS (100%GET) for the skewed workload. For the uniform workload,the throughput is 6%-10% lower. In a real-world scenariowhere a mix of get and getk queries are sent by clients,the performance of Mega-KV should lie between get andgetk ’s maximum throughput, i.e., 120-166 MOPS for theskewed workload and 100-150 MOPS for the uniform work-load. With the 95% GET workload, Mega-KV is 1.6-2.2times as fast as MICA (76.3 MOPS), the CPU-based IMKVsystem with the highest known throughput. To address thelimitations from the memory accesses, these experiments areperformed with a GET hit ratio of more than 99.9%.

The skewed workload has a better performance than theuniform workload. This is because our system applies ashared memory design within each NUMA node; thus theskewed workload will not result in an imbalanced load on theCPU cores. As accessing key-value items in memory is stillthe major bottleneck, the most frequently visited key-valueitems may benefit from the CPU cache, and consequentlythe skewed workload leads to a higher throughput.

7.2.3 System Throughput with Large KeyValue Sizes

Because the network bandwidth may become a bottleneckfor large key-value items, the performance for various sizedkeys and values is measured by loading generated key-valueitems from the local memory, and values are written intopacket buffers as responses.

Figure 15 shows the performance of Mega-KV with work-loads b, c, and d by loading data from memory. For the95% GET 5% SET skewed workload, Mega-KV achieves110 MOPS, 55 MOPS, and 34 MOPS with get query, andachieves 139 MOPS, 60 MOPS, and 30 MOPS with getkquery. For the 100% GET skewed workload, Mega-KV achieves107 MOPS, 56 MOPS, and 36 MOPS with get query, andachieves 144 MOPS, 62 MOPS, and 33 MOPS with getkquery, respectively. As is shown in the figure, the key com-parison operation involved in the get query processing de-grades system throughput for 20%-30% with the small sizekey-value workload. With a larger key-value size, the over-head of accessing the large values in the memory becomesdramatically higher, and thus the ratio of key comparisoncost is much smaller and even neglectable.

For comparison, we also measure the performance of CPU-based MICA as the baseline. The throughput of Mega-KVis 1.3-2.9 times as high as MICA with 95% GET 5% SET,

1235

0

10

20

30

(3) 128 B Key, 1024 B Value

MICA Mega-KV

w/ getk

Mega-KV

w/ get

Th

rou

gh

pu

t (M

OP

S)

0

40

80

120

160

(1) 16 B Key, 64 B Value

MICA Mega-KV

w/ getk

Mega-KV

w/ get

0

20

40

60

(2) 32 B Key, 512 B Value

MICA Mega-KV

w/ getk

Mega-KV

w/ get

100% GET Uniform 95% GET, 5% SET Uniform 100% GET Skewed 95% GET, 5% SET Skewed

Figure 15: Mega-KV Performance with Large Key-Value Size

CD

F

0

0.25

0.5

0.75

1

Round Trip Time (microsecond)

0 100 200 300 400 500 600 700 800 900 1000

C=120 C=160 C=200 C=240 C=280 C=320

Figure 16: System Latency

and is 1.3-2.6 times as high as MICA with 100% GET.

7.3 Response Time Distribution

We evaluate system processing latency by measuring thetime elapsed from sending a getk query to receiving its re-sponse. The client keeps sending queries with a 95% GETand 5% SET workload, and the client IP address is increasedby 1 for each packet. By sample logging the IP addressesand the sending/receiving time, the round trip time can becalculated as the time elapsed between the queries and re-sponses with matched IP addresses.The scheduling cycle has an impact on the both the sys-

tem performance and the processing latency. Figure 16shows the CDF (Cumulative Distribution Function) of queryround trip latency with different scheduling cycles. We fixthe scheduling cycle C to 120, 160, 200, 240, 280, and 320microseconds, respectively, and use the allowed maximuminput speed for each scheduling cycle. As shown in the fig-ure, the round trip time of queries is effectively controlledwithin 3 × C, which demonstrates the effectiveness of ourGPU scheduling policy in achieving predictable latency.The latency of Mega-KV is comparatively higher than a

fast CPU-based implementation. Specifically, if an applica-tion requires a strictly low latency of less than 317.4 (105.8× 3) microseconds, Mega-KV cannot guarantee that thedeadline could be met by all the queries. This is becausethe minimum scheduling cycle C for Mega-KV is 105.8 mi-croseconds (Section 7.2.1). MICA, which adopts DPDK asits NIC driver, achieves a latency lying between 24-52 mi-

croseconds. As is shown in Figure 13, Mega-KV achievesa maximum throughput of 204 MOPS for the 95% GETand 5% SET workload with C = 160 microseconds. Withthis configuration, the average and 95th percentile latenciesof Mega-KV are about 280 microseconds and 410 microsec-onds (Figure 16). However, in Facebook, the average and95th percentile latencies of web servers in production gettingkeys are about 300 microseconds and 1,200 microseconds, re-spectively [25]. Therefore, although the latency of Mega-KVis higher than that of a fast CPU-based implementation, itis still capable of meeting the requirements of the currentdata processing systems.

8. RELATED WORKCPU-based in-memory key-value stores [11, 22, 24, 23]

have been focusing on designing efficient index data struc-ture and optimizing network processing to achieve higherperformance. MICA has compared itself with RAMCloud,MemC3, Memcached, and Masstree in its paper, and shownan at least 4 times higher throughput than the next best sys-tem. That is why we choose MICA for performance compar-ison. Systems such as Chronos [19], Pilaf [24], and RAM-Cloud [26] focus on low latency processing, which achievelatencies of less than 100 microseconds. Specifically, by tak-ing advantage of Infiniband, RAMCloud and Pilaf achieveaverage latencies of as low as 5 microseconds and 11.3 mi-croseconds, respectively.

Rhythm [4] proposes to use GPUs in accelerating webserver workloads, which also batches requests to trade re-sponse time for higher throughput. The latency of the re-quests in Rhythm is above 10 milliseconds, which is hun-dreds of times higher than that of a CPU-based web server,but its throughput is about 10 times higher.

Although previous work has adopted GPUs in a key-valuestore system [17], it only ports the existing Memcached im-plementation to an OpenCL version, and does not exploreGPUs’ hardware characteristics for higher performance.

Graphic applications do not need a persistent hash table;instead, they need hash tables to be quickly constructed,such as [5]. The methods adopted in building the hash tablesmay result in construction failures, and do not fit for key-value store systems.

There are already a set of research papers on adoptingGPUs in database systems [35, 33, 32, 28, 15, 16, 34]. Thetechniques we developed in Mega-KV can also be utilized to

1236

accelerate the relational database query processing. For ex-ample, cuckoo hash table is adopted in implementing GPU-based hash join [35]. Therefore, Mega-KV’s GPU-optimizedcuckoo hash table with its corresponding operations is agood candidate for accelerating the hash join operation.

9. CONCLUSIONHaving conducted thorough experiments and analyses, we

have identified the bottleneck of IMKV running on multi-core processors, which is a mismatch between the uniqueproperties IMKV for increasingly large data processing andthe CPU-based architecture. We have made a strong case bydesigning and implementing Mega-KV for GPUs to serve asspecial-purpose devices to address the bottleneck that mul-ticore architectures cannot break. Our evaluation resultsshow that Mega-KV advances the state of the art of IMKVby significantly boosting its performance up to 160+ MOPS.Mega-KV is an open source software. The source code can

be downloaded from http://kay21s.github.io/megakv.

10. ACKNOWLEDGMENTWe thank the reviewers for their feedback. This work was

supported in part by the National Science Foundation undergrants CCF-0913050, OCI-1147522, and CNS-1162165.

11. REFERENCES[1] Intel dpdk. "http://dpdk.org/".[2] Memcached. "http://memcached.org/".[3] Redis. "http://redis.io/".

[4] S. R. Agrawal, V. Pistol, J. Pang, J. Tran, D. Tarjan, andA. R. Lebeck. Rhythm: Harnessing data parallel hardwarefor server workloads. In ASPLOS, pages 19–34, 2014.

[5] D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta,M. Mitzenmacher, J. D. Owens, and N. Amenta. Real-timeparallel hashing on the gpu. In SIGGRAPH Asia, pages154:1–154:9, 2009.

[6] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, andM. Paleczny. Workload analysis of a large-scale key-valuestore. In SIGMETRICS, pages 53–64, 2012.

[7] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan,and R. Sears. Benchmarking cloud serving systems withycsb. In SoCC, pages 143–154, 2010.

[8] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.Introduction to Algorithm, Thrid Edition. The MIT Press,2009.

[9] J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, andS. Zdonik. Anti-caching: A new approach to databasemanagement system architecture. In PVLDB, 2013.

[10] U. Erlingsson, M. Manasse, and F. McSherry. A cool andpractical alternative to traditional hash tables. In WDAS,pages 1–6, 2006.

[11] B. Fan, D. G. Andersen, and M. Kaminsky. Memc3:Compact and concurrent memcache with dumber cachingand smarter hashing. In NSDI, pages 371–384, 2013.

[12] M. Ferdman, A. Adileh, O. Kocberber, S. Volos,M. Alisafaee, D. Jevdjic, C. Kaynak, A. Popescu,A. Ailamaki, and B. Falsafi. A case for specializedprocessors for scale-out workloads. In Micro, pages 31–42,2014.

[13] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J.Weinberger. Quickly generating billion-record syntheticdatabases. In SIGMOD, pages 243–252, 1994.

[14] S. Han, K. Jang, K. Park, and S. Moon. Packetshader: Agpu-accelerated software router. In SIGCOMM, pages195–206, 2010.

[15] B. He and J. X. Yu. High-throughput transactionexecutions on graphics processors. In PVLDB, 2011.

[16] M. Heimel, M. Saecker, H. Pirk, S. Manegold, andV. Markl. Hardware-oblivious parallelism for in-memorycolumn-stores. In PVLDB, pages 709–720, 2013.

[17] T. Hetherington, T. Rogers, L. Hsu, M. O’Connor, andT. Aamodt. Characterizing and evaluating a key-valuestore application on heterogeneous cpu-gpu systems. InISPASS, pages 88–98, 2012.

[18] E. Y. Jeong, S. Woo, M. Jamshed, H. Jeong, S. Ihm,D. Han, and K. Park. mtcp: A highly scalable user-leveltcp stack for multicore systems. In NSDI, 2014.

[19] R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, andA. Vahdat. Chronos: Predictable low latency for datacenter applications. In SoCC, pages 9:1–9:14, 2012.

[20] T. Leng, R. Ali, J. Hsieh, V. Mashayekhi, andR. Rooholamini. An empirical study of hyper-threading inhigh performance computing clusters. Linux HPCRevolution, 2002.

[21] H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. Mica:A holistic approach to fast in-memory key-value storage. InNSDI, pages 429–444, 2014.

[22] Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness forfast multicore key-value storage. In EuroSys, pages183–196, 2012.

[23] Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. Cphash:A cache-partitioned hash table. In PPoPP, pages 319–320,2012.

[24] C. Mitchell, Y. Geng, and J. Li. Using one-sided rdmareads to build a fast, cpu-efficient key-value store. InUSENIX ATC, pages 103–114, 2013.

[25] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee,H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab,D. Stafford, T. Tung, and V. Venkataramani. Scalingmemcache at facebook. In NSDI, pages 385–398, 2013.

[26] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis,J. Leverich, D. Mazieres, S. Mitra, A. Narayanan,G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann,and R. Stutsman. The case for ramclouds: Scalablehigh-performance storage entirely in dram. SIGOPS Oper.Syst. Rev., pages 92–105, 2010.

[27] R. Pagh and F. F. Rodler. Cuckoo hashing. Journal ofAlgorithms, pages 122–144, 2003.

[28] H. Pirk, S. Manegold, and M. Kersten. Waste not...efficient co-processing of relational data. In ICDE, pages508–519, 2014.

[29] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone,D. B. Kirk, and W.-m. W. Hwu. Optimization principlesand application performance evaluation of a multithreadedgpu using cuda. In PPoPP, pages 73–82, 2008.

[30] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden.Speedy transactions in multicore in-memory databases. InSOSP, 2013.

[31] K. Wang, X. Ding, R. Lee, S. Kato, and X. Zhang. Gdm:Device memory management for gpgpu computing. InSIGMETRICS, pages 533–545, 2014.

[32] K. Wang, Y. Huai, R. Lee, F. Wang, X. Zhang, and J. H.Saltz. Accelerating pathology image data cross-comparisonon cpu-gpu hybrid systems. In PVLDB, pages 1543–1554,2012.

[33] K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, andX. Zhang. Concurrent analytical query processing withgpus. In PVLDB, pages 1011–1022, 2014.

[34] H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili.Kernel weaver: Automatically fusing database primitivesfor efficient gpu computation. In MICRO, pages 107–118,2012.

[35] Y. Yuan, R. Lee, and X. Zhang. The yin and yang ofprocessing data warehousing queries on gpu devices. InPVLDB, pages 817–828, 2013.

[36] H. Zhang, B. M. Tudor, G. Chen, and B. C. Ooi. Efficientin-memory data management: An analysis. In PVLDB,2014.

1237

mega­kv: a case for gpus to maximize the throughput of · pdf filemega­kv: a case for...

Documents

megakv: a case for gpus to maximize the throughput of · pdf filemegakv: a case for...