design patterns for tunable and efficient ssd-based indexes
DESCRIPTION
Design Patterns for Tunable and Efficient SSD-based Indexes. Ashok Anand , Aaron Gember -Jacobson , Collin Engstrom , Aditya Akella. Large hash-based indexes. ≈20K lookups and inserts per second (1Gbps link). ≥ 32GB hash table. WAN optimizers [ Anand et al. SIGCOMM ’08]. - PowerPoint PPT PresentationTRANSCRIPT
Ashok Anand, Aaron Gember-Jacobson, Collin Engstrom, Aditya Akella
1
Design Patterns for Tunable and Efficient SSD-based Indexes
Large hash-based indexes
2
WAN optimizers[Anand et al. SIGCOMM ’08]
De-duplicationsystems[Quinlan et al. FAST ‘02]
VideoProxy
[Anand et al. HotNets ’12]
≈20K lookups and inserts per second(1Gbps link)
≥ 32GB hash table
Use of large hash-based indexes
3
WANoptimizers
De-duplicationsystems
VideoProxy
Where to store the indexes?
4
Where to store the indexes?
SSD8x less 25x less
What’s the problem?
• Need domain/workload-specific optimizations for SSD-based index with↑ performance and ↓overhead
• Existing designs have…– Poor flexibility – target a specific point
in the cost-performance spectrum
– Poor generality – only apply to specific workloads or data structures
5
False assumption!
Our contributions
• Design patterns that ensure:– High performance– Flexibility– Generality
• Indexes based on these principles:– SliceHash– SliceBloom– SliceLSH
6
Outline
Problem statement• Limitations of state-of-the-art• SSD architecture• Parallelism-friendly design patterns– SliceHash (streaming hash table)
• Evaluation
7
• BufferHash [Anand et al. NSDI ’10]
– Designed for high throughput
State-of-the-art SSD-based index
8
0123
0123
K,VK,VK,V
0123
K,VK,VK,V
0123
K,VK,V
K,V
In-memoryincarnation
incarnation
KA,VA
KB,VB
KC,VC
0123
KA,VA
KB,VB
KC,VC
K#( )2
Bloom filter
2
4 bytes perK/V pair!
16 page reads in worst case!
(average: ≈1)
• SILT [Lim et al. SOSP ‘11]– Designed for low memory + high throughput
0123
State-of-the-art SSD-based index
9
Log Hash
0123
KA,VA
KB,VB
KC,VC
Sorted
Hash table K,V
K,V
K,V K,VK,V
Index
≈0.7 bytesper K/V pair
33 page reads in worst case!(average: 1)
High CPU usage!
Target specific workloads and objectives → poor flexibility and generality
Do not leverage internal parallelism
Flash mempackage 1
Die 1 Die n
10
Flash mem pkg 126
Flash mem pkg 128
Flash mem pkg 4
Plane 1Plane 2
Plane 1
Plan
e 2
Data register
Block 1Page 1Page 2
Block 2Page 1Page 2
SSD controller
Channel 1
Channel 32
…
…
Flash mem pkg 125 …
SSD Architecture
…How does the SSD architecture inform our design patterns?
Flash memory
package 1
Four design principles
I. Store related entries on the same page
II. Write to the SSD at block granularity
III. Issue large reads and large writes
IV. Spread small reads across channels
11
Flash memory
package 1
Flash memory
package 4
Block 2
Channel 1
Channel 32
…
…
Block 1Page 1Page 2Page 1
Flash memory
package 4
SliceHash
I. Store related entries on the same page
• Many hash table incarnations, like BufferHash
Incarnation4 6: K,V5: K,V
20: K,V 1: K,V 3: K,V7: K,V
1 2: K,V 3: K,V54: K,V 6: K,V 7: K,V
0: K,V
64: K,V2: K,V 30: K,V 1: K,V
7: K,V5: K,V
Page
12
Sequential slots from a specific
incarnation
K#( )55
Multiple page reads per lookup!
I. Store related entries on the same page
• Many hash table incarnations, like BufferHash• Slicing: store same hash slot from
all incarnations on the same page
4 6: K,V5: K,V20: K,V 1: K,V 3: K,V
7: K,V1 2: K,V 3: K,V54: K,V 6: K,V 7: K,V
0: K,V
64: K,V2: K,V 30: K,V 1: K,V
7: K,V5: K,V4
6: K,V5: K,V
2
0: K,V1: K,V
3: K,V
7: K,V
12: K,V3: K,V
54: K,V
6: K,V7: K,V
0: K,V
6
4: K,V
2: K,V3
0: K,V1: K,V
7: K,V
5: K,V
Page
Incarnation
SliceOnly 1 page
read per lookup!
13
5
Specific slot from all
incarnations
• Insert into a hash table incarnation in RAM• Divide the hash table so all slices
fit into one block
01234567
2
0: K,V1: K,V
3: K,V
12: K,V3: K,V
0: K,V
2: K,V3
0: K,V1: K,V
II. Write to the SSD at block granularity
Incarnation
KB,VB
KC,VC
KA,VA
KD,VD
KE,VE
KF,VF
4
6: K,V5: K,V
7: K,V
54: K,V
6: K,V7: K,V
6
4: K,V
7: K,V
5: K,V
Block
KB,VB
KC,VC
KE,VE
KA,VA
KD,VD
KF,VF
14
SliceTable
III. Issue large reads and large writes
15
Package 1PageReg
Package 2PageReg
Package 3PageReg
1 2 4 8 16 32 64 1280
50100150200250300
Read size (KB)
MB/
seco
nd re
ad
Page size
Channelparallelism
Packageparallelism
Package 4PageReg
Channel 1
Channel 2
2 6 10 14 18 22 26 300
50
100
150
200
128KB Writes256KB Writes512KB Writes
# threads
MB/
seco
nd w
ritten
III. Issue large reads and large writes
SSD assigns consecutive chunks (4 pages/8KB) to different channels
16
Block size
Channelparallelism
• Read entire SliceTable into RAM
• Write entire SliceTable onto SSD
0123
III. Issues large reads and large writes
4
6: K,V5: K,V
7: K,V
54: K,V
6: K,V7: K,V
6
4: K,V
7: K,V
5: K,V
(Block) 2: K,V2
0: K,V1: K,V
3: K,V
1
3: K,V
0: K,V
2: K,V3
0: K,V1: K,V
2: K,V2
0: K,V1: K,V
3: K,V
1
3: K,V
0: K,V
2: K,V3
0: K,V1: K,V
KA,VA
KD,VD
KF,VF
1: KA,VA
2: KD,VD
3: KF,VF
0
2: K,V2
0: K,V1: K,V
3: K,V
1
3: K,V
0: K,V
2: K,V3
0: K,V1: K,V 1: KA,VA
2: KD,VD
3: KF,VF
0
17
IV. Spread small reads across channels
• Recall: SSD writes consecutive chunks (4 pages) of a block to different channels – Use existing techniques to reverse
engineer [Chen et al. HPCA ‘11]
– SSD uses write-order mapping
18
channel for chunk i = i modulo (# channels)
• Estimate channel using slot # and chunk size• Attempt to schedule 1 read per channel
(slot # * pages per slot)modulo
(# channels * pages per chunk)
( * pages per slot)modulo
(# channels * pages per chunk)
IV. Spread small reads across channels
19
Channel 0
Channel 1
Channel 2
Channel 3
2 1
1
14
4 5 0
01234567
2
0: K,V1: K,V
3: K,V
12: K,V3: K,V
0: K,V
2: K,V3
0: K,V1: K,V
SliceHash summary
In-memoryincarnation
KB,VB
KC,VC
KA,VA
KD,VD
KE,VE
KF,VF
4
6: K,V5: K,V
7: K,V
54: K,V
6: K,V7: K,V
6
4: K,V
7: K,V
5: K,V
BlockKB,VB
KC,VC
KE,VE
KA,VA
KD,VD
KF,VF
20
SliceTable
Page
Incarnation
Slice
Specific slot from all incarnations
4
6: K,V5: K,V
7: K,V
54: K,V
6: K,V7: K,V
6
4: K,V
7: K,V
5: K,V
2
0: K,V1: K,V
3: K,V
12: K,V3: K,V
0: K,V
2: K,V3
0: K,V1: K,V
Read/write when updating
Slice
Hash
BufferHas
hSIL
T012345
0
20
40
60
80
Mem
ory
(byt
es/e
ntry
)
CPU
util
izati
on (%
)
Slice
Hash
BufferHas
hSIL
T0
20406080
100120140
Thro
ughp
ut (K
op
s/se
c)Evaluation: throughput vs. overhead
21
128GBCrucial M4
2.26Ghz4-core
↑6.6x↓12%
8B key8B value
50% insert50% lookup
↑2.8x↑15%
See paper for theoretical analysis
Evaluation: flexibility
• Trade-off memory for throughput
22
SH 64 In
c.
SH 48 In
c.
SH 32 In
c.
SH 16 In
c.
BufferHas
hSIL
T0
20406080
100120140
Thro
ughp
ut (K
ops
/sec
)
SH 64 In
c.
SH 48 In
c.
SH 32 In
c.
SH 16 In
c.
BufferHas
hSIL
T0
1
2
3
4
5
Mem
ory
(byt
es/e
ntry
)
50% insert50% lookup
Use multiple SSDs for even ↓ memory use and ↑ throughput
Evaluation: generality
• Workload may change
23
Slice
Hash
BufferHas
hSIL
T0
200400600800
10001200
Lookup-onlyMixedInsert-only
Thro
ughp
ut (K
ops
/sec
)
SH BH SILT0
1
2
3
4 Memory (bytes/entry)
CPU utilization (%)
SH BH SILT0
255075
100
Decreasing!
Constantly low!
Summary
• Present design practices for low cost and high performance SSD-based indexes
• Introduce slicing to co-locate related entries and leverage multiple levels of SSD parallelism
• SliceHash achieves 69K lookups/sec (≈12% better than prior works), with consistently low memory (0.6B/entry) and CPU (12%) overhead
24
Evaluation: theoretical analysis
• Parameters– 16B key/value pairs– 80% table utilization– 32 incarnations– 4GB of memory– 128GB SSD– 0.31ms to read a block– 0.83ms to write a block– 0.15ms to read a page
25
overhead
0.6 B/entry
costavg: ≈5.7μs
worst: 1.14ms
cost
avg & worst: 0.15ms
Evaluation: theoretical analysis
26
overhead
0.6 B/entry
costavg: ≈5.7μs
worst: 1.14ms
cost
avg & worst: 0.15ms
BufferHash
4B/entry
avg: ≈0.2usworst: 0.83ms
avg: ≈0.15msworst: 4.8ms