high%performance/key%value/store on openshmemww2.cs.fsu.edu/~fu/files/ccgrid17-slides.pdf · 2017....

Huansong Fu*, Manjunath Gorentla Venkata†,

Ahana Roy Choudhury*,Neena Imam†, Weikuan Yu*

*Florida State University†Oak Ridge National Laboratory

High-Performance Key-Value Storeon OpenSHMEM

S-2

Outline

• Background• SHMEMCache

– Overview– Challenges– Design

• Experiments• Conclusion

S-3

Distributed In-memory Key-Value Store• Caching popular key-value (KV) pairs in memory in a

distributed manner for fast access (mainly reads).– E.g. Memcached used by Facebook serves >1 billion request per second.

• Client-server architecture.– Client and server communicate over TCP/IP.

HDD

slow

//Key-value OperationsSET(key, value);GET(key);

ServersClient

memory

memory

memory

BackendDatabase

fast

S-4

Memcached Structure• Two-stage hashing and KV pair locating

– 1st stage to a server and 2nd stage to a hash entry in that server.– A hash entry contains pointers to locate KV pairs.– Server chases chained pointers to find desired KV pair.

• Storage and eviction policy– Memory storage is divided into KV blocks in various sizes.– A KV pair is stored in the smallest possible KV block.– Least Recent Used (LRU) for evicting KV pairs.

Chaining

Hash entry

KV block

KV evicted

Server Server

Client1st stage

2nd stage

S-5

PGAS and OpenSHMEM• Partitioned Global Address Space (PGAS) enables a large

memory address space for parallel programming.– UPC, CAF, SHMEM.

• OpenSHMEM is a recent standardization effort of SHMEM.– Participating processes are called Processing Elements (PE).– A symmetric memory space that is globally-visible.

• Variables have same name and same address on different PE.– Communication through one-sided operations.

• Put: write to symmetric memory.• Get: read from symmetric memory.

S-6

Our Objective• There is a high suitability of leveraging OpenSHMEM’s feature

set for building a distributed in-memory key-value store.– Similar memory-style addressing: client “sees” memory storage.– One-sided communication relieves the server of participation in

communication.– Portability on both commodity servers and HPC systems.

• We propose SHMEMCache, a high-performance key-value store built on top of OpenSHMEM.

S-7

Outline




S-8

Structure of SHMEMCache• Server stores KV pairs in symmetric memory. • Client performs two types of KV operations.

– Direct KV operation: client directly accessing KV pairs.– Active KV operation: client informs server to conduct KV operations by

writing messages.

• There are three major challenges in realizing this structure.

PE 1 (server) KV blocks

hash table

messagechunks

Pointer Directory

PE 0 (client)

OpenSHMEM one-sided communication

Symmetric memory

DirectKV Operation

ActiveKV Operation

SET/GET

S-9

Challenges #1• Concurrent reads/writes cause read-write & write-write races.

– Solution: read-friendly exclusive write.

1. Read-write & write-write races

Communicate

KV block

Client

Server

S-10

Read-Friendly Exclusive Write• Key issue: concurrent write can invalidate both read and write.

– Write must be conducted exclusively.

• Solving write-write race: use locks to write exclusively.– Locks are expensive, so not for the much more frequent reads.

• Solving read-write race: a new optimistic concurrency control.– Existing solutions are heavy-weight: checksum, cacheline versioning…– We propose an efficient head-tail versioning mechanism.

• Combining locking and versioning operations to minimize overheads.

S-11

Read-Friendly Exclusive Write (cont.)• Structure of KV block

• Write operations:– (1) an atomic CAS for locking and writing tail version; (2) a Put for

writing KV content and unlocking; (3) a Get for writing head version.

• Read operation: – A Get for reading the whole KV block.

Writer A

Writer B

Reader B

Time

lock &write v1 write v1

Reader A

read v0 read v1

fail to locklock &write v2

Exclusive write

read v0 read v1 read v1 read v1

head version(e.g. v0)

KV pairtail version + tag + lock bit

unusedspace

unlock

S-12

Challenges #2• Concurrent reads/writes cause read/write & write/write races.• Remotely locating a KV pair incurs costly remote pointer chasing.

– Solution: set-associative hash table and pointer directory.

2. Remote pointer chasing

Chaining

Communicate

Hash entry

KV block


S-13

• Set-associative hash table– Multiple pointers are fetched with one hash lookup.– If pointer not found in a hash entry, only server chases pointers.

• Pointer directory– New pointer is stored after first read/write.– Recency and count indicate how recent and frequent a KV pair has been

accessed. They help decide which pointer to keep in the directory.

Mitigating Remote Pointer Chasing

Tag Pointer

Hash Table

Mainentry

K-V Store

sub-entry

1001100100011110

Pointer directory

Tag Pointer Recency Count

unused bits

S-14

Challenges #3• Concurrent reads/writes cause read/write & write/write races.• Remotely locating a KV pair incurs costly remote pointer chasing.• Server/client is unaware of other’s access/eviction actions.

– Solution: Coarse-grained cache management.


2. Remote pointer chasing

3. Unawarenessof access & eviction access

unware

eviction unware

KV block

Client

Server

KV evicted

S-15

Coarse-grained Cache Management• Relaxed recency definition from time point to time range.• Non-intrusive recency update

– When recency changes, client directly update it using atomic CAS.

• Batch eviction and lazy invalidation– Server evicts KV pairs in batches with the oldest recency, and notifies

client the new expiration bar. KV pair that has higher recency than the evicted batch will be updated. New KV pair will be inserted to the top.

– Client lazily invalidate pointers.

Server

expiration bar = 100

batch eviction

insert

Pointer directory

Lazy invalidation

Client

non-intrusive update

update

1000

……900

0

100

S-16

Outline




S-17

Experimental Setup• Innovation

– An in-house cluster with 21 dual-socket server nodes, each featuring 10 Intel Xeon(R) cores and 64 GB memory. All nodes are connected through an FDR Infiniband interconnect with the ConnectX- 3 NIC.

• Titan supercomputer– Titan is a hybrid-architecture Cray XK7 system, which consists of 18,688

nodes and each node is equipped with a 16-core AMD Opteron CPU and 32GB of DDR3 memory.

• Benchmarks: microbenchmark and YCSB. • OpenSHMEM version

– In-house cluster: Open MPI v1.10.3– Titan: Cray SHMEM v7.4.0

S-18

Latency• Innovation cluster

– SHMEMCache outperforms Memcached by 25x and 20x for SET and GET latencies.

– SHMEMCache outperforms HiBD by 128% and 16% for SET and GETlatencies.

Fig. 1 SET latency. Fig. 2 GET latency.

1

10

100

1000

1 16 256 4K 64K 512K

Tim

e (µ

s)

Value Size(Bytes)

MemcachedHiBDSHMEMCache

1

10

100

1000

1 16 256 4K 64K 512K

Tim

e (µ

s)

Value Size(Bytes)

MemcachedHiBDSHMEMCache

S-19

Latency (cont.)• Titan supercomputer

– Direct KV operation is consistently faster. – SHMEMCache’s achieves comparable performance on Titan with the

innovation cluster.

Fig. 1 SET latency. Fig. 2 GET latency.

1

10

100

1000

1 16 256 4K 64K 512K

Tim

e (µ

s)

Value Size(Bytes)

ActiveDirect

1

10

100

1000

1 16 256 4K 64K 512K

Tim

e (µ

s)

Value Size(Bytes)

ActiveDirect

S-20

Throughput• Innovation cluster

– SHMEMCache outperforms Memcached by 19x and 33x for 32-Byte and 4-KB SET, and 14x and 30x for 32-Byte and 4-KB GET.

Fig. 1 SET throughput. Fig. 2 GET throughput.

10

100

1000

10000

1 2 4 8 16

Thro

ughp

ut (K

ops/

S)

Number of Clients

Memcached (32 Byte)SHMEMCache (32 Byte)Memcached (4 KB)SHMEMCache (4 KB)

10

100

1000

10000

1 2 4 8 16

Thro

ughp

ut (K

ops/

S)

Number of Clients

Memcached (32 Byte)SHMEMCache (32 Byte)Memcached (4 KB)SHMEMCache (4 KB)

S-21

Throughput (cont.)

Fig. 1 SET throughput. Fig. 2 GET throughput.

• Titan supercomputer– SHMEMCache scales well on Titan supercomputer to up to 1024

machines.

100

1000

10000

100000

1 4 16 64 256 1024

Thro

ughp

ut (K

ops/

S)

Number of Clients

32 Byte4 KB

100

1000

10000

100000

1 4 16 64 256 1024

Thro

ughp

ut (K

ops/

S)

Number of Clients

32 Byte4 KB

S-22

Hit Ratio of Pointer Directory• Varying YCSB workloads and number of entries in the pointer

directory.• Larger pointer directory helps increase hit ratio.

– The impact is getting smaller due to less popular KV pairs. – No need to maximize its size at the cost of low space efficiency.

0

0.2

0.4

0.6

0.8

1

1.2

100 % SET + 0% GET

50% SET + 50% GET

5% SET + 95% GET

0% SET + 100% GET

Pointer Directory Hit Ratio

128 entries 256 entries 512 entries

S-23

Conclusion• SHMEMCache, a high-performance distributed in-memory KV

store built on top of OpenSHMEM.• Novel designs to tackle three major challenges.

– Read-friendly write for solving read-write and write-write races.– Set-associative hash table and pointer directory for mitigating remote

pointer chasing.– Coarse-grained cache management for reducing high costs of informing

server/client about access/eviction operations.

• Evaluation results showing that SHEMMCache achieves high-performance at scale of 1024.

S-24

Acknowledgment

S-25

Thank You and Questions?

high%performance/key%value/store on openshmemww2.cs.fsu.edu/~fu/files/ccgrid17-slides.pdf · 2017....

Documents