design patterns for tunable and efficient ssd-based indexes

Ashok Anand, Aaron Gember-Jacobson, Collin Engstrom, Aditya Akella

1

Design Patterns for Tunable and Efficient SSD-based Indexes

Large hash-based indexes

2

WAN optimizers[Anand et al. SIGCOMM ’08]

De-duplicationsystems[Quinlan et al. FAST ‘02]

VideoProxy

[Anand et al. HotNets ’12]

≈20K lookups and inserts per second(1Gbps link)

≥ 32GB hash table

Use of large hash-based indexes

3

WANoptimizers

De-duplicationsystems

VideoProxy

Where to store the indexes?

4

Where to store the indexes?

SSD8x less 25x less

What’s the problem?

• Need domain/workload-specific optimizations for SSD-based index with↑ performance and ↓overhead

• Existing designs have…– Poor flexibility – target a specific point

in the cost-performance spectrum

– Poor generality – only apply to specific workloads or data structures

5

False assumption!

Our contributions

• Design patterns that ensure:– High performance– Flexibility– Generality

• Indexes based on these principles:– SliceHash– SliceBloom– SliceLSH

6

Outline

Problem statement• Limitations of state-of-the-art• SSD architecture• Parallelism-friendly design patterns– SliceHash (streaming hash table)

• Evaluation

7

• BufferHash [Anand et al. NSDI ’10]

– Designed for high throughput

State-of-the-art SSD-based index

8

0123

0123

K,VK,VK,V

0123

K,VK,VK,V

0123

K,VK,V

K,V

In-memoryincarnation

incarnation

KA,VA

KB,VB

KC,VC

0123

KA,VA

KB,VB

KC,VC

K#( )2

Bloom filter

2

4 bytes perK/V pair!

16 page reads in worst case!

(average: ≈1)

• SILT [Lim et al. SOSP ‘11]– Designed for low memory + high throughput

0123

State-of-the-art SSD-based index

9

Log Hash

0123

KA,VA

KB,VB

KC,VC

Sorted

Hash table K,V

K,V

K,V K,VK,V

Index

≈0.7 bytesper K/V pair

33 page reads in worst case!(average: 1)

High CPU usage!

Target specific workloads and objectives → poor flexibility and generality

Do not leverage internal parallelism

Flash mempackage 1

Die 1 Die n

10

Flash mem pkg 126

Flash mem pkg 128

Flash mem pkg 4

Plane 1Plane 2

Plane 1

Plan

e 2

Data register

Block 1Page 1Page 2

Block 2Page 1Page 2

SSD controller

Channel 1

Channel 32

…

…

Flash mem pkg 125 …

SSD Architecture

…How does the SSD architecture inform our design patterns?

Flash memory

package 1

Four design principles

I. Store related entries on the same page

II. Write to the SSD at block granularity

III. Issue large reads and large writes

IV. Spread small reads across channels

11

Flash memory

package 1

Flash memory

package 4

Block 2

Channel 1

Channel 32

…

…

Block 1Page 1Page 2Page 1

Flash memory

package 4

SliceHash


• Many hash table incarnations, like BufferHash

Incarnation4 6: K,V5: K,V

20: K,V 1: K,V 3: K,V7: K,V

1 2: K,V 3: K,V54: K,V 6: K,V 7: K,V

0: K,V

64: K,V2: K,V 30: K,V 1: K,V

7: K,V5: K,V

Page

12

Sequential slots from a specific

incarnation

K#( )55

Multiple page reads per lookup!


• Many hash table incarnations, like BufferHash• Slicing: store same hash slot from

all incarnations on the same page

4 6: K,V5: K,V20: K,V 1: K,V 3: K,V

7: K,V1 2: K,V 3: K,V54: K,V 6: K,V 7: K,V

0: K,V

64: K,V2: K,V 30: K,V 1: K,V

7: K,V5: K,V4

6: K,V5: K,V

2

0: K,V1: K,V

3: K,V

7: K,V

12: K,V3: K,V

54: K,V

6: K,V7: K,V

0: K,V

6

4: K,V

2: K,V3

0: K,V1: K,V

7: K,V

5: K,V

Page

Incarnation

SliceOnly 1 page

read per lookup!

13

5

Specific slot from all

incarnations

• Insert into a hash table incarnation in RAM• Divide the hash table so all slices

fit into one block

01234567

2

0: K,V1: K,V

3: K,V

12: K,V3: K,V

0: K,V

2: K,V3

0: K,V1: K,V

II. Write to the SSD at block granularity

Incarnation

KB,VB

KC,VC

KA,VA

KD,VD

KE,VE

KF,VF

4

6: K,V5: K,V

7: K,V

54: K,V

6: K,V7: K,V

6

4: K,V

7: K,V

5: K,V

Block

KB,VB

KC,VC

KE,VE

KA,VA

KD,VD

KF,VF

14

SliceTable


15

Package 1PageReg

Package 2PageReg

Package 3PageReg

1 2 4 8 16 32 64 1280

50100150200250300

Read size (KB)

MB/

seco

nd re

ad

Page size

Channelparallelism

Packageparallelism

Package 4PageReg

Channel 1

Channel 2

2 6 10 14 18 22 26 300

50

100

150

200

128KB Writes256KB Writes512KB Writes

# threads

MB/

seco

nd w

ritten


SSD assigns consecutive chunks (4 pages/8KB) to different channels

16

Block size

Channelparallelism

• Read entire SliceTable into RAM

• Write entire SliceTable onto SSD

0123

III. Issues large reads and large writes

4

6: K,V5: K,V

7: K,V

54: K,V

6: K,V7: K,V

6

4: K,V

7: K,V

5: K,V

(Block) 2: K,V2

0: K,V1: K,V

3: K,V

1

3: K,V

0: K,V

2: K,V3

0: K,V1: K,V

2: K,V2

0: K,V1: K,V

3: K,V

1

3: K,V

0: K,V

2: K,V3

0: K,V1: K,V

KA,VA

KD,VD

KF,VF

1: KA,VA

2: KD,VD

3: KF,VF

0

2: K,V2

0: K,V1: K,V

3: K,V

1

3: K,V

0: K,V

2: K,V3

0: K,V1: K,V 1: KA,VA

2: KD,VD

3: KF,VF

0

17


• Recall: SSD writes consecutive chunks (4 pages) of a block to different channels – Use existing techniques to reverse

engineer [Chen et al. HPCA ‘11]

– SSD uses write-order mapping

18

channel for chunk i = i modulo (# channels)

• Estimate channel using slot # and chunk size• Attempt to schedule 1 read per channel

(slot # * pages per slot)modulo

(# channels * pages per chunk)

( * pages per slot)modulo

(# channels * pages per chunk)


19

Channel 0

Channel 1

Channel 2

Channel 3

2 1

1

14

4 5 0

01234567

2

0: K,V1: K,V

3: K,V

12: K,V3: K,V

0: K,V

2: K,V3

0: K,V1: K,V

SliceHash summary

In-memoryincarnation

KB,VB

KC,VC

KA,VA

KD,VD

KE,VE

KF,VF

4

6: K,V5: K,V

7: K,V

54: K,V

6: K,V7: K,V

6

4: K,V

7: K,V

5: K,V

BlockKB,VB

KC,VC

KE,VE

KA,VA

KD,VD

KF,VF

20

SliceTable

Page

Incarnation

Slice

Specific slot from all incarnations

4

6: K,V5: K,V

7: K,V

54: K,V

6: K,V7: K,V

6

4: K,V

7: K,V

5: K,V

2

0: K,V1: K,V

3: K,V

12: K,V3: K,V

0: K,V

2: K,V3

0: K,V1: K,V

Read/write when updating

Slice

Hash

BufferHas

hSIL

T012345

0

20

40

60

80

Mem

ory

(byt

es/e

ntry

)

CPU

util

izati

on (%

)

Slice

Hash

BufferHas

hSIL

T0

20406080

100120140

Thro

ughp

ut (K

op

s/se

c)Evaluation: throughput vs. overhead

21

128GBCrucial M4

2.26Ghz4-core

↑6.6x↓12%

8B key8B value

50% insert50% lookup

↑2.8x↑15%

See paper for theoretical analysis

Evaluation: flexibility

• Trade-off memory for throughput

22

SH 64 In

c.

SH 48 In

c.

SH 32 In

c.

SH 16 In

c.

BufferHas

hSIL

T0

20406080

100120140

Thro

ughp

ut (K

ops

/sec

)

SH 64 In

c.

SH 48 In

c.

SH 32 In

c.

SH 16 In

c.

BufferHas

hSIL

T0

1

2

3

4

5

Mem

ory

(byt

es/e

ntry

)

50% insert50% lookup

Use multiple SSDs for even ↓ memory use and ↑ throughput

Evaluation: generality

• Workload may change

23

Slice

Hash

BufferHas

hSIL

T0

200400600800

10001200

Lookup-onlyMixedInsert-only

Thro

ughp

ut (K

ops

/sec

)

SH BH SILT0

1

2

3

4 Memory (bytes/entry)

CPU utilization (%)

SH BH SILT0

255075

100

Decreasing!

Constantly low!

Summary

• Present design practices for low cost and high performance SSD-based indexes

• Introduce slicing to co-locate related entries and leverage multiple levels of SSD parallelism

• SliceHash achieves 69K lookups/sec (≈12% better than prior works), with consistently low memory (0.6B/entry) and CPU (12%) overhead

24

Evaluation: theoretical analysis

• Parameters– 16B key/value pairs– 80% table utilization– 32 incarnations– 4GB of memory– 128GB SSD– 0.31ms to read a block– 0.83ms to write a block– 0.15ms to read a page

25

overhead

0.6 B/entry

costavg: ≈5.7μs

worst: 1.14ms

cost

avg & worst: 0.15ms

Evaluation: theoretical analysis

26

overhead

0.6 B/entry

costavg: ≈5.7μs

worst: 1.14ms

cost

avg & worst: 0.15ms

BufferHash

4B/entry

avg: ≈0.2usworst: 0.83ms

avg: ≈0.15msworst: 4.8ms

design patterns for tunable and efficient ssd-based indexes

Documents

hash slot

multiple page

ssd architecturehow

efficient ssdbased indexes

ssd controller channel

specific incarnation

specific slot

flash memory package