cache craftiness for fast multicore key-value storage yandong mao (mit), eddie kohler (harvard),...

38
Cache Craftiness for Fast Multicore Key- Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Upload: britney-smith

Post on 16-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Cache Craftiness for Fast Multicore Key-Value Storage

Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Page 2: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Let’s build a fast key-value store

• KV store systems are important– Google Bigtable, Amazon Dynamo, Yahoo! PNUTS

• Single-server KV performance matters– Reduce cost– Easier management

• Goal: fast KV store for single multi-core server– Assume all data fits in memory– Redis, VoltDB

Page 3: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Feature wish list

• Clients send queries over network

• Persist data across crashes

• Range query

• Perform well on various workloads– Including hard ones!

Page 4: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Hard workloads

• Skewed key popularity– Hard! (Load imbalance)

• Small key-value pairs– Hard!

• Many puts– Hard!

• Arbitrary keys– String (e.g. www.wikipedia.org/...) or integer– Hard!

Page 5: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

First try: fast binary tree

Series10

1

2

3

4

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

• Network/disk not bottlenecks• High-BW NIC• Multiple disks

• 3.7 million queries/second!

• Better?• What bottleneck remains?• DRAM!

Page 6: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Cache craftiness goes 1.5X farther

Binary Masstree0

1

2

3

4

5

6

7

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Cache-craftiness: careful use of cache and memory

Page 7: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Contributions

• Masstree achieves millions of queries per second across various hard workloads– Skewed key popularity– Various read/write ratios– Variable relatively long keys– Data >> on-chip cache

• New ideas– Trie of B+ trees, permuter, etc.

• Full system– New ideas + best practices (network, disk, etc.)

Page 8: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Experiment environment

• A 16-core server– three active DRAM nodes

• Single 10Gb Network Interface Card (NIC)

• Four SSDs

• 64 GB DRAM

• A cluster of load generators

Page 9: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Potential bottlenecks in Masstree

Single multi-core server

Network

Disk

log log

…DRAM

Page 10: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

NIC bottleneck can be avoided

• Single 10Gb NIC– Multiple queue, scale to many cores– Target: 100B KV pair => 10M/req/sec

• Use network stack efficiently– Pipeline requests– Avoid copying cost

Page 11: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Disk bottleneck can be avoided

• 10M/puts/sec => 1GB logs/sec!• Single disk

• Multiple disks: split log– See paper for details Single multi-core server

Write throughput Cost

Mainstream Disk 100-300 MB/sec 1 $/GB

High performance SSD up to 4.4GB/sec > 40 $/GB

Page 12: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

DRAM bottleneck – hard to avoid

Binary Masstree0

1

2

3

4

5

6

7

140M short KV, put-only, @16 coresTh

roug

hput

(req

/sec

, mill

ions

)

Cache-craftiness goes 1.5X father, including the cost of:• Network• Disk

Page 13: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

DRAM bottleneck – w/o network/disk

Binary 4-tree B+tree +Prefetch +Permuter Masstree0

1

2

3

4

5

6

7

8

9

10

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Cache-craftiness goes 1.7X father!

Page 14: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

DRAM latency – binary tree

Binary0

1

2

3

4

5

6140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

B

A C

Y

X Z

serial DRAM latencies!

10M keys =>

VoltDB

2.7 us/lookup 380K lookups/core/sec

Page 15: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

DRAM latency – Lock-free 4-way tree

• Concurrency: same as binary tree• One cache line per node => 3 KV / 4 children

X Y Z

A B … … …

½ levels as binary tree½ DRAM latencies as binary tree

Page 16: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

4-tree beats binary tree by 40%

Binary 4-tree B+tree +Prefetch +Permuter Masstree0

1

2

3

4

5

6

7

8

9

10

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Page 17: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

4-tree may perform terribly!

• Unbalanced: serial DRAM latencies– e.g. sequential inserts

• Want balanced tree w/ wide fanout

A B C

D E F

G H I

O(N) levels!

Page 18: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

B+tree – Wide and balanced

• Balanced!

• Concurrent main memory B+tree [OLFIT]– Optimistic concurrency control: version technique– Lookup/scan is lock-free– Puts hold ≤ 3 per-node locks

Page 19: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Wide fanout B+tree is 11% slower!

Binary 4-tree B+tree +Prefetch +Permuter Masstree0

1

2

3

4

5

6

7

8

9

10

140M short KV, put-only

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Fanout=15, fewer levels than 4-tree, but • # cache lines from DRAM >= 4-tree

• 4-tree: each internal node is full• B+tree: nodes are ~75% full

• Serial DRAM latencies >= 4-tree

Page 20: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

B+tree – Software prefetch

• Same as [pB+-trees]

• Masstree: B+tree w/ fanout 15 => 4 cache lines• Always prefetch whole node when accessed• Result: one DRAM latency per node vs. 2, 3, or 4

4 lines

1 line

=

Page 21: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

B+tree with prefetch

Binary 4-tree B+tree +Prefetch +Permuter Masstree0

1

2

3

4

5

6

7

8

9

10

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Beats 4-tree by 9%Balanced beats unbalanced!

Page 22: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Concurrent B+tree problem

• Lookups retry in case of a concurrent insert

• Lock-free 4-tree: not a problem– keys do not move around– but unbalanced

A C D A C D

A B C D

insert(B)Intermediate state!

Page 23: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

B+tree optimization - Permuter

• Keys stored unsorted, define order in tree nodes

• A concurrent lookup does not need to retry– Lookup uses permuter to search keys– Insert appears atomic to lookups

A C D A C D B

A C D B

insert(B)

0 1 2

Permuter: 64-bit integer

…0 3 1 …2

Page 24: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

B+tree with permuter

Binary 4-tree B+tree +Prefetch +Permuter Masstree0

1

2

3

4

5

6

7

8

9

10

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Improve by 4%

Page 25: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Performance drops dramatically when key length increases

8 16 24 32 40 480

1

2

3

4

5

6

7

8

9

Short values, 50% updates, @16 cores, no logging

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Key lengthKeys differ in last 8B

Why? Stores key suffix indirectly, thus each key comparison • compares full key• extra DRAM fetch

Page 26: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

… B+tree, indexed by k[0:7]

B+tree, indexed by k[8:15]

B+tree, indexed by k[16:23]

Masstree – Trie of B+trees

• Trie: a tree where each level is indexed by fixed-length key fragment

• Masstree: a trie with fanout 264, but each trie node is a B+tree

• Compress key prefixes!

Page 27: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Case Study: Keys share P byte prefix – Better than single B+tree

• trie levels• each has one node only

A single B+tree with 8B keys

Complexity DRAM access

Masstree O(log N) O(log N)

Single B+tree O(P log N) O(P log N)

Page 28: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Masstree performs better for long keys with prefixes

8 16 24 32 40 480123456789

10

MasstreeB+tree

Short values, 50% updates, @16 cores, no logging

8B key comparison vs.

full key comparison

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Key length

Page 29: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Does trie of B+trees hurt short key performance?

Binary 4-tree B+tree +Prefetch +Permuter Masstree0123456789

10

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

8% faster! More efficient code – internal node handle 8B keys only

Page 30: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Evaluation

• Masstree compare to other systems?• Masstree compare to partitioned trees?– How much do we pay for handling skewed

workloads?• Masstree compare with hash table?– How much do we pay for supporting range queries?

• Masstree scale on many cores?

Page 31: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Masstree performs well even with persistence and range queries

MongoDB VoltdB Redis Memcached Masstree0

2

4

6

8

10

12

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

20M short KV, uniform dist., read-only, @16 cores, w/ network

0.04 0.22

Unfair: both have a richer data and query model

Memcached: not persistent and no range queries

Redis: no range queries

Page 32: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Multi-core – Partition among cores?

• Multiple instances, one unique set of keys per inst.– Memcached, Redis, VoltDB

• Masstree: a single shared tree– each core can access all keys– reduced imbalance

B

A C

Y

X Z

B

A C

Y

X Z

Page 33: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

A single Masstree performs better for skewed workloads

0 1 2 3 4 5 6 7 8 90

2

4

6

8

10

12Masstree16 partitioned Masstrees

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

δ

140M short KV, read-only, @16 cores, w/ network

One partition receives δ times more queries

No remote DRAM accessNo concurrency control

Partition: 80% idle time1 partition: 40% 15 partitions: 4%

Page 34: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Cost of supporting range queries

• Without range query? One can use hash table– No resize cost: pre-allocate a large hash table– Lock-free: update with cmpxchg– Only support 8B keys: efficient code– 30% full, each lookup = 1.1 hash probes

• Measured in the Masstree framework– 2.5X the throughput of Masstree

• Range query costs 2.5X in performance

Page 35: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Scale to 12X on 16 cores

Number of cores

Thro

ughp

ut (r

eq/s

ec/c

ore,

mill

ions

)

1 2 4 8 160

100000

200000

300000

400000

500000

600000

700000

Get

Perfect scalability

• Scale to 12X • Put scales similarly• Limited by the shared

memory system

Short KV, w/o logging

Page 36: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Related work

• [OLFIT]: Optimistic Concurrency Control• [pB+-trees]: B+tree with software prefetch• [pkB-tree]: store fixed # of diff. bits inline• [PALM]: lock-free B+tree, 2.3X as [OLFIT]

• Masstree: first system combines them together, w/ new optimizations– Trie of B+trees, permuter

Page 37: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Summary

• Masstree: a general-purpose high-performance persistent KV store

• 5.8 million puts/sec, 8 million gets/sec– More comparisons with other systems in paper

• Using cache-craftiness improves performance by 1.5X

Page 38: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Thank you!