design tradeoffs for ssd reliability · • y. cai et al, “data retention in mlc nand flash...
TRANSCRIPT
![Page 1: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/1.jpg)
Design Tradeoffs forSSD Reliability
Bryan S. Kim, Jongmoo Choi, Sang Lyul Min
Seoul National University, Dankook University
![Page 2: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/2.jpg)
High-level objectives
Understand the SSD-internal mechanisms behind fail-slow symptoms
• H. Gunawi et al, “Fail-slow at scale: evidence of hardware performance faults in large production systems”, FAST 2018
![Page 3: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/3.jpg)
High-level objectives
Examine SSD-internal reliability enhancement techniques
• Images from Google searches
![Page 4: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/4.jpg)
High-level objectives
Think about system- and device-level approachesfor handling errors
• Images from Google searches
![Page 5: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/5.jpg)
How bad is it?
5
• L. Grupp et al, “Characterizing flash memory: anomalies, observations, and applications”, Micro 2009
1E-8
1E-7
1E-6
1E-5
1E-4
1E-3
1E-2
1E-1
2008 2010 2012 2014 2016 2018 2020
Err
or
rate
mea
sure
ment
Year published
![Page 6: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/6.jpg)
How bad is it?
6
• H. Sun et al, “Quantifying reliability of solid-state storage from multiple aspects”, SNAPI 2011• Y. Cai et al, “Error patterns in MLC NAND flash memory: measurement, characterization, and analysis”, DATE 2012
1E-8
1E-7
1E-6
1E-5
1E-4
1E-3
1E-2
1E-1
2008 2010 2012 2014 2016 2018 2020
Err
or
rate
mea
sure
ment
Year published
![Page 7: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/7.jpg)
How bad is it?
7
• Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015• Data from an industry partner, 2018
1E-8
1E-7
1E-6
1E-5
1E-4
1E-3
1E-2
1E-1
2008 2010 2012 2014 2016 2018 2020
Err
or
rate
mea
sure
ment
Year published
![Page 8: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/8.jpg)
SSD’s reliability issue
Error-prone
memory
ReliableSSD
RBER: 10-4~10-2 UBER: <10-15
• How to make SSD reliable?
• Performance overhead?
• Across different chips and wear states?
8
![Page 9: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/9.jpg)
Flash memory errors
Wear-out
CG
FG
Vprog
9
![Page 10: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/10.jpg)
Flash memory errors
Wear-out Retention loss
CG
FG
Vprog
CG
FG
0V
10
-- --
![Page 11: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/11.jpg)
Flash memory errors
Wear-out Retention loss Disturbance
CG
FG
Vprog
CG
FG
0V
CG
FG
Vpass
11
-- -- --
![Page 12: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/12.jpg)
Flash memory error modeling
RBER (cycles, time, reads)= ε+ α ∙ cyclesk
+ β ∙ cyclesm ∙ timen
+ γ ∙ cyclesp ∙ readsq
• N. Mielke et al, “Reliability of solid-state drives based on NAND flash memory”, Proceedings of the IEEE, 2017
12
![Page 13: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/13.jpg)
From measurements to model
• H. Sun et al, “Quantifying reliability of solid-state storage from multiple aspects”, SNAPI 2011• Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015• Y. Cai et al, “Read disturb errors in MLC NAND flash memory: characterization, mitigation, and recovery”, DSN 2015• Data from an industry partner, 2018
Measurement (data) Model
• 3x-nm MLC (2011)
• 2y-nm MLC (2015)
• 3D TLC (2018)
13
![Page 14: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/14.jpg)
1E-7
1E-6
1E-5
1E-4
1E-3
1E-2
1E-1
1E0
Raw
bit
err
or
rate
Wear Retention Disturbance
Error model: 3x-nm MLC (2011)
Wear up to10K P/E cycles
10K P/E cycles +up to 10K readsor up to 1 year
14
![Page 15: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/15.jpg)
Error model: 2y-nm MLC (2015)
15
1E-7
1E-6
1E-5
1E-4
1E-3
1E-2
1E-1
1E0
Raw
bit
err
or
rate
Wear Retention Disturbance
Wear up to10K P/E cycles
10K P/E cycles +up to 10K readsor up to 1 year
![Page 16: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/16.jpg)
Error model: 3D TLC (2018)
16
1E-7
1E-6
1E-5
1E-4
1E-3
1E-2
1E-1
1E0
Raw
bit
err
or
rate
Wear Retention Disturbance
Wear up to10K P/E cycles
10K P/E cycles +up to 10K readsor up to 1 year
![Page 17: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/17.jpg)
SSD reliability enhancements
• Error correction code
• Data re-reads
• Intra-SSD redundancy
• Background relocation
17
![Page 18: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/18.jpg)
Error correction code
ECCencoder
ECC decoder
Flash memory
18
![Page 19: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/19.jpg)
Error correction code
Data DataP
ECCencoder
ECC decoder
Flash memory
19
![Page 20: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/20.jpg)
Error correction code
ECCencoder
ECC decoderData Data P
Flash memory
20
![Page 21: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/21.jpg)
Flash memory
Data re-reads
ECCencoder
ECC decoder Data P
21
![Page 22: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/22.jpg)
Flash memory
Data re-reads
ECCencoder
ECC decoder Data P
1 2
22
![Page 23: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/23.jpg)
Data re-reads
ECCencoder
ECC decoder
Data P
Flash memory
Data
23
![Page 24: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/24.jpg)
Summary: ECC and data re-reads
• Error correction code
– Predictable performance
– Is fixed at design-time
• Data re-read
– Is much more powerful than ECC
– Increases latency for correcting errors
24
![Page 25: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/25.jpg)
Evaluation: data re-readFor the 3D TLC (2018)
25
0
0.5
1
1.5
2
2.5
3
3.5
25-bit 50-bit 75-bit 100-bit
No
rm.
av
g.
RT
ECC correction strength
1K cycles 3K cycles 5K cycles
![Page 26: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/26.jpg)
Why is data re-read bad?For the 3D TLC (2018)
26
0.5
0.6
0.7
0.8
0.9
1
0 25 50 75 100 125 150 175
Cu
mu
lati
ve
pro
bab
ilit
y
Number of raw bit errors
inf-bit ECC
100-bit ECC
75-bit ECC
50-bit ECC
25-bit ECC
![Page 27: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/27.jpg)
Observations
• Repeated data re-reads make it worse
– 75-bit: ~30% increased latency at end-of-life
27
![Page 28: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/28.jpg)
Intra-SSD redundancy
D0D1D2D3D4D5D6
D0
D1
D2
D3
D4
D5
D6
P
P
28
Flash memory chips
![Page 29: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/29.jpg)
Intra-SSD redundancy
D0
D1
D2
D3
D4
D5
D6
PD0D1PD3D4D5D6D2
29
Flash memory chips
![Page 30: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/30.jpg)
Summary: intra-SSD redundancy
• Error correction code– Is fixed at design-time
• Data re-read– Increases latency for correcting errors
• Intra-SSD redundancy– Protects against random and sporadic errors– Increases write amplification– Increases read amplification on errors
30
![Page 31: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/31.jpg)
Evaluation: redundancy
0
0.5
1
1.5
2
s = 15 s = 7
No
rm.
av
g.
RT
Stripe size
1K cycles 3K cycles 5K cycles
0
0.5
1
1.5
2
2.5
3
3.5
4
s = 15 s = 7
No
rm.
3 m
ines
Qo
S
Stripe size
1K cycles 3K cycles 5K cycles
For the 3D TLC (2018) with 75-bit ECC
31
![Page 32: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/32.jpg)
Observations
• Repeated data re-reads make it worse
• Overheads of redundancy outweigh its benefits
– +56% latency at end-of-life
32
![Page 33: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/33.jpg)
Observations
• Repeated data re-reads make it worse
• Overheads of redundancy outweigh its benefits
• Scrubbing reduces error-induced latency,but increases internal traffic– +25% latency at end-of-life
– Highly dependent on accuracy of error prediction
33
![Page 34: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/34.jpg)
Observations
• Repeated data re-reads make it worse
• Overheads of redundancy outweigh its benefits
• Scrubbing reduces error-induced latency,but increases internal traffic
• We need to consider data characteristicsand compositionally combine reliability enhancements
34
![Page 35: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/35.jpg)
Holistic reliability management
• Cold data
– Need protection against retention errors
– Least write amplification with redundancy
– Likely to be identified by GC
![Page 36: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/36.jpg)
Holistic reliability management
• Cold data
– Selective redundancy for GC-ed data
• Read-hot data
– Need protection against disturbance errors
– # of data re-reads can be used as proxy
– Likely to be identified by scrubber
![Page 37: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/37.jpg)
Holistic reliability management
• Cold data
– Selective redundancy for GC-ed data
• Read-hot data
– Cost-benefit scrubbing
• Write-hot data
– No special attention required
![Page 38: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/38.jpg)
Evaluation
38
0
1
2
3
4
No
rm.
av
g. R
TECC + re-read Oracle scrub HRM
9.0 5.4
ECC + re-read : Rely on ECC and data re-readsOracle scrub : Scrub based on oracle knowledgeHRM : Holistic reliability management
For the 3D TLC (2018)with 75-bit ECC@ end-of-life state
![Page 39: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/39.jpg)
The bright side of flash memory
• S. Lee, “Emerging Challenges in NAND Flash Technology”, Flash Summit 2011
39
![Page 40: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry](https://reader034.vdocuments.site/reader034/viewer/2022052011/60279a1c978fd5394e3cab6e/html5/thumbnails/40.jpg)
1E-8
1E-7
1E-6
1E-5
1E-4
1E-3
1E-2
1E-1
2008 2010 2012 2014 2016 2018 2020
Err
or
rate
mea
sure
ment
Year published
The dark side of flash memory
40