high%performance/key%value/store on openshmemww2.cs.fsu.edu/~fu/files/ccgrid17-slides.pdf · 2017....

25
Huansong Fu*, Manjunath Gorentla Venkata†, Ahana Roy Choudhury*, Neena Imam†, Weikuan Yu* *Florida State University †Oak Ridge National Laboratory HighPerformance KeyValue Store on OpenSHMEM

Upload: others

Post on 17-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Huansong Fu*,  Manjunath Gorentla Venkata†,

    Ahana Roy  Choudhury*,Neena Imam†,  Weikuan Yu*

    *Florida  State  University†Oak  Ridge  National  Laboratory

    High-Performance  Key-Value  Storeon  OpenSHMEM

  • S-2

    Outline

    • Background• SHMEMCache

    – Overview– Challenges– Design

    • Experiments• Conclusion

  • S-3

    Distributed  In-memory  Key-Value  Store• Caching popular key-value (KV) pairs in memory in a

    distributed manner for fast access (mainly reads).– E.g. Memcached used by Facebook serves >1 billion request per second.

    • Client-server architecture.– Client and server communicate over TCP/IP.

    HDD

    slow

    //Key-value OperationsSET(key, value);GET(key);

    ServersClient

    memory

    memory

    memory

    BackendDatabase

    fast

  • S-4

    Memcached Structure• Two-stage hashing and KV pair locating

    – 1st stage to a server and 2nd stage to a hash entry in that server.– A hash entry contains pointers to locate KV pairs.– Server chases chained pointers to find desired KV pair.

    • Storage and eviction policy– Memory storage is divided into KV blocks in various sizes.– A KV pair is stored in the smallest possible KV block.– Least Recent Used (LRU) for evicting KV pairs.

    Chaining

    Hash  entry

    KV  block

    KV  evicted

    Server Server

    Client1st stage

    2nd stage

  • S-5

    PGAS  and  OpenSHMEM• Partitioned Global Address Space (PGAS) enables a large

    memory address space for parallel programming.– UPC, CAF, SHMEM.

    • OpenSHMEM is a recent standardization effort of SHMEM.– Participating processes are called Processing Elements (PE).– A symmetric memory space that is globally-visible.

    • Variables have same name and same address on different PE.– Communication through one-sided operations.

    • Put: write to symmetric memory.• Get: read from symmetric memory.

  • S-6

    Our  Objective• There is a high suitability of leveraging OpenSHMEM’s feature

    set for building a distributed in-memory key-value store.– Similar memory-style addressing: client “sees” memory storage.– One-sided communication relieves the server of participation in

    communication.– Portability on both commodity servers and HPC systems.

    • We propose SHMEMCache, a high-performance key-value store built on top of OpenSHMEM.

  • S-7

    Outline

    • Background• SHMEMCache

    – Overview– Challenges– Design

    • Experiments• Conclusion

  • S-8

    Structure  of  SHMEMCache• Server stores KV pairs in symmetric memory. • Client performs two types of KV operations.

    – Direct KV operation: client directly accessing KV pairs.– Active KV operation: client informs server to conduct KV operations by

    writing messages.

    • There are three major challenges in realizing this structure.

    PE  1  (server) KV  blocks

    hash  table

    messagechunks

    Pointer  Directory

    PE  0 (client)

    OpenSHMEM one-sided  communication

    Symmetric  memory

    DirectKV  Operation

    ActiveKV  Operation

    SET/GET

  • S-9

    Challenges  #1• Concurrent reads/writes cause read-write & write-write races.

    – Solution: read-friendly exclusive write.

    1.  Read-write  &  write-write  races

    Communicate

    KV  block

    Client

    Server

  • S-10

    Read-Friendly  Exclusive  Write• Key issue: concurrent write can invalidate both read and write.

    – Write must be conducted exclusively.

    • Solving write-write race: use locks to write exclusively.– Locks are expensive, so not for the much more frequent reads.

    • Solving read-write race: a new optimistic concurrency control.– Existing solutions are heavy-weight: checksum, cacheline versioning…– We propose an efficient head-tail versioning mechanism.

    • Combining locking and versioning operations to minimize overheads.

  • S-11

    Read-Friendly  Exclusive  Write  (cont.)• Structure of KV block

    • Write operations:– (1) an atomic CAS for locking and writing tail version; (2) a Put for

    writing KV content and unlocking; (3) a Get for writing head version.

    • Read operation: – A Get for reading the whole KV block.

    Writer  A

    Writer  B

    Reader  B

    Time

    lock  &write  v1 write  v1

    Reader  A

    read  v0 read  v1

    fail  to  locklock  &write  v2

    Exclusive  write

    read  v0 read  v1 read  v1 read  v1

    head  version(e.g.  v0)

    KV  pairtail  version  +  tag  +  lock  bit

    unusedspace

    unlock

  • S-12

    Challenges  #2• Concurrent reads/writes cause read/write & write/write races.• Remotely locating a KV pair incurs costly remote pointer chasing.

    – Solution: set-associative hash table and pointer directory.

    2.  Remote  pointer  chasing

    Chaining

    Communicate

    Hash  entry

    KV  block

    1.  Read-write  &  write-write  races

  • S-13

    • Set-associative hash table– Multiple pointers are fetched with one hash lookup.– If pointer not found in a hash entry, only server chases pointers.

    • Pointer directory– New pointer is stored after first read/write.– Recency and count indicate how recent and frequent a KV pair has been

    accessed. They help decide which pointer to keep in the directory.

    Mitigating  Remote  Pointer  Chasing

    Tag Pointer

    Hash  Table

    Mainentry

    K-V  Store

    sub-entry

    1001100100011110

    Pointer  directory

    Tag Pointer Recency Count

    unused  bits

  • S-14

    Challenges  #3• Concurrent reads/writes cause read/write & write/write races.• Remotely locating a KV pair incurs costly remote pointer chasing.• Server/client is unaware of other’s access/eviction actions.

    – Solution: Coarse-grained cache management.

    1.  Read-write  &  write-write  races

    2.  Remote  pointer  chasing

    3.  Unawarenessof  access  &  eviction access  

    unware

    eviction  unware

    KV  block

    Client

    Server

    KV  evicted

  • S-15

    Coarse-grained  Cache  Management• Relaxed recency definition from time point to time range.• Non-intrusive recency update

    – When recency changes, client directly update it using atomic CAS.

    • Batch eviction and lazy invalidation– Server evicts KV pairs in batches with the oldest recency, and notifies

    client the new expiration bar. KV pair that has higher recency than the evicted batch will be updated. New KV pair will be inserted to the top.

    – Client lazily invalidate pointers.

    Server

    expiration  bar  =  100

    batch  eviction

    insert

    Pointer  directory

    Lazy  invalidation

    Client

    non-intrusive  update

    update

    1000

    ……900

    0

    100

  • S-16

    Outline

    • Background• SHMEMCache

    – Overview– Challenges– Design

    • Experiments• Conclusion

  • S-17

    Experimental  Setup• Innovation

    – An in-house cluster with 21 dual-socket server nodes, each featuring 10 Intel Xeon(R) cores and 64 GB memory. All nodes are connected through an FDR Infiniband interconnect with the ConnectX- 3 NIC.

    • Titan supercomputer– Titan is a hybrid-architecture Cray XK7 system, which consists of 18,688

    nodes and each node is equipped with a 16-core AMD Opteron CPU and 32GB of DDR3 memory.

    • Benchmarks: microbenchmark and YCSB. • OpenSHMEM version

    – In-house cluster: Open MPI v1.10.3– Titan: Cray SHMEM v7.4.0

  • S-18

    Latency• Innovation cluster

    – SHMEMCache outperforms Memcached by 25x and 20x for SET and GET latencies.

    – SHMEMCache outperforms HiBD by 128% and 16% for SET and GETlatencies.

    Fig. 1 SET latency. Fig. 2 GET latency.

    1

    10

    100

    1000

    1 16 256 4K 64K 512K

    Tim

    e (µ

    s)

    Value Size(Bytes)

    MemcachedHiBDSHMEMCache

    1

    10

    100

    1000

    1 16 256 4K 64K 512K

    Tim

    e (µ

    s)

    Value Size(Bytes)

    MemcachedHiBDSHMEMCache

  • S-19

    Latency  (cont.)• Titan supercomputer

    – Direct KV operation is consistently faster. – SHMEMCache’s achieves comparable performance on Titan with the

    innovation cluster.

    Fig. 1 SET latency. Fig. 2 GET latency.

    1

    10

    100

    1000

    1 16 256 4K 64K 512K

    Tim

    e (µ

    s)

    Value Size(Bytes)

    ActiveDirect

    1

    10

    100

    1000

    1 16 256 4K 64K 512K

    Tim

    e (µ

    s)

    Value Size(Bytes)

    ActiveDirect

  • S-20

    Throughput• Innovation cluster

    – SHMEMCache outperforms Memcached by 19x and 33x for 32-Byte and 4-KB SET, and 14x and 30x for 32-Byte and 4-KB GET.

    Fig. 1 SET throughput. Fig. 2 GET throughput.

    10

    100

    1000

    10000

    1 2 4 8 16

    Thro

    ughp

    ut (K

    ops/

    S)

    Number of Clients

    Memcached (32 Byte)SHMEMCache (32 Byte)Memcached (4 KB)SHMEMCache (4 KB)

    10

    100

    1000

    10000

    1 2 4 8 16

    Thro

    ughp

    ut (K

    ops/

    S)

    Number of Clients

    Memcached (32 Byte)SHMEMCache (32 Byte)Memcached (4 KB)SHMEMCache (4 KB)

  • S-21

    Throughput  (cont.)

    Fig. 1 SET throughput. Fig. 2 GET throughput.

    • Titan supercomputer– SHMEMCache scales well on Titan supercomputer to up to 1024

    machines.

    100

    1000

    10000

    100000

    1 4 16 64 256 1024

    Thro

    ughp

    ut (K

    ops/

    S)

    Number of Clients

    32 Byte4 KB

    100

    1000

    10000

    100000

    1 4 16 64 256 1024

    Thro

    ughp

    ut (K

    ops/

    S)

    Number of Clients

    32 Byte4 KB

  • S-22

    Hit  Ratio  of  Pointer  Directory• Varying YCSB workloads and number of entries in the pointer

    directory.• Larger pointer directory helps increase hit ratio.

    – The impact is getting smaller due to less popular KV pairs. – No need to maximize its size at the cost of low space efficiency.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    100  %  SET  +  0%  GET

    50%  SET  +  50%  GET

    5%  SET  +  95%  GET

    0%  SET  +  100%  GET

    Pointer  Directory    Hit  Ratio

    128 entries 256 entries 512 entries

  • S-23

    Conclusion• SHMEMCache, a high-performance distributed in-memory KV

    store built on top of OpenSHMEM.• Novel designs to tackle three major challenges.

    – Read-friendly write for solving read-write and write-write races.– Set-associative hash table and pointer directory for mitigating remote

    pointer chasing.– Coarse-grained cache management for reducing high costs of informing

    server/client about access/eviction operations.

    • Evaluation results showing that SHEMMCache achieves high-performance at scale of 1024.

  • S-24

    Acknowledgment

  • S-25

    Thank  You  and  Questions?