![Page 1: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/1.jpg)
Unblinding the OS to Optimize User-Perceived Flash SSD Latency
Woong Shin*, Jaehyun Park**, Heon Y. Yeom*
*Seoul National University**Arizona State University
USENIX HotStorage 2016Jun. 21, 2016
![Page 2: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/2.jpg)
OS I/O Path Optimizations: Reducing S/W Overheads
Simplified I/O Path
Application
StorageDevice
Issue Sideuser thread context
Completion Sideuser thread context
OS I/O Path
Overheadfrom 13 usdown to
0.2 ~ 2.5 us
Under 20usTechnology
Memory technology with ultra low (nanoseconds), predictable latencies.
“Block CPU (poll) while waiting for the I/O to complete”
![Page 3: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/3.jpg)
Cannot use polling: Wastes CPU cyclesHarms system parallelism
Issue Sideuser thread context
Completion Sideuser thread context
OS I/O Path
CPU
(i.e., read: 20us ~ 150us)Higher latency & High Variability
Hard IRQDeferred Processing
CPUCPU
To Block the CPU (sync) or to Yield the CPU (async)
H/W context switch := 4 us ~ 5 usOS involved switch := 7 us ~ 8 us
“Impact of Modern SSDs”More contexts on a CPU coreLarger scheduling delays
![Page 4: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/4.jpg)
700,000 IOPS NVM-e SSDBandwidth:4kB x 700 kIOPS = 2.8 GB/s
Impact of Modern SSDs
?!?Latency:1 sec / 700 kIOPS = 1.42 µs
![Page 5: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/5.jpg)
Impact of Modern SSDs
700,000 IOPS NVM-e SSDSingle NAND die: apprx. 14,285 8kB IOPS (i.e, 70us read latency)Requires more than 49 NAND dies to achieve 700,000 IOPS
Multi-channel Multi-wayHigh die count NAND array
LargeDRAM
PowerfulControllers
![Page 6: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/6.jpg)
Impact of Modern SSDs
700,000 IOPS NVM-e SSDSingle NAND die: apprx. 14,285 8kB IOPS (i.e, 70us read latency)Requires more than 49 NAND dies to achieve 700,000 IOPS
Multi-channel Multi-wayHigh die count NAND array
LargeDRAM
PowerfulControllers
High count of I/O contexts (threads, state machines) required for high IOPS
![Page 7: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/7.jpg)
Higher Context Multiplexing Cost“Scheduling Delays”
Multi-channel Multi-wayHigh die count NAND array
LargeDRAM
PowerfulControllers
NAND die count >>> CPU core counti.e., Four PCI-e 3.0 4lane slots in a chassis, more SSDs
Redundancy, more capacity ...
CPU CPU
CPU Core
Worker threadWith multipleI/O contexts(Async. I/O)
Scheduler
CPU Core
Time sharing a CPU core(Sync. I/O)
ContextTo
ContextContextTo
Context
![Page 8: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/8.jpg)
Issue Sideuser thread context
Completion Sideuser thread context
OS I/O Path
The OS is Blind: Conservative Strategies
SSDControllerDRAM
![Page 9: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/9.jpg)
Issue Sideuser thread context
Completion Sideuser thread context
OS I/O Path
20us ~ 10msHigher latency, Higher Variance
Hard IRQDeferred Processing
The OS is Blind: Conservative Strategies
DRAMR W R
W E
R R
Physical Destination within the SSD
R
R
R RE
W
NAND dies
Externally
Experienced
Latency
SSDController
![Page 10: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/10.jpg)
This Work: Unblinding the OSIssue Side
user thread contextCompletion Sideuser thread context
OS I/O Path
Hard IRQDeferred Processing
H/W context switch := 4 us ~ 5 usOS involved switch := 7 us ~ 8 usMore contexts on a CPU core(higher)
“Predictable SSD Latency”
![Page 11: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/11.jpg)
This Work: Unblinding the OSIssue Side
user thread contextCompletion Sideuser thread context
OS I/O Path
I/O destination
Latency
OS SSD
“Predictable SSD Latency”
Hard IRQDeferred Processing
![Page 12: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/12.jpg)
This Work: Unblinding the OSIssue Side
user thread contextCompletion Sideuser thread context
OS I/O Path
“Predictable SSD Latency”
Latency
I/O destinationx I/O type
DRAM NANDSmallR/W Small READ
NANDElse
Predictable Predictable UnpredictableA B C
DRAM NAND NANDSmallR/W Small READ Else
Latency
I/O destinationx I/O type
Predictable Predictable UnpredictableA B C
BetterInteraction
ExtendedInterfaceOS SSD
Internal KnowledgeExposed Knowledge“A simplified model”
Hard IRQDeferred Processing
![Page 13: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/13.jpg)
This Work: Unblinding the OSIssue Side
user thread contextCompletion Sideuser thread context
OS I/O Path
“Predictable SSD Latency”
Latency
I/O destinationx I/O type
DRAM NANDSmallR/W Small READ
NANDElse
Predictable Predictable UnpredictableA B C
SmallR/W Small READ Else
Latency
I/O destinationx I/O type
Predictable Predictable UnpredictableA B C
BetterInteraction
ExtendedInterfaceOS SSD
Internal KnowledgeExposed Knowledge“A simplified model”
Hard IRQDeferred Processing
Exploiting SSD Internal InformationTowards a “Predictable SSD”New Optimization Opportunities
![Page 14: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/14.jpg)
Multi-channel Multi-wayHigh die count NAND array
DRAM SSD Controller
Exploiting SSD Internal Information
OS SSD
The UnpredictableSSD
![Page 15: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/15.jpg)
Exploiting SSD Internal Information“Decomposition & Classification”
Multi-channel Multi-wayHigh die count NAND array
DRAM SSD Controller
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R Small W Large R Large W
Cache hit Buffer hit N/A N/A
Interleaved writes
Cache miss Buffer miss Interleavedreads
Predictable Predictable
Predictable Unpredictable No benefit No benefit
![Page 16: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/16.jpg)
Exploiting SSD Internal Information“Decomposition & Classification”
Multi-channel Multi-wayHigh die count NAND array
DRAM SSD Controller
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R
Cache hit
Cache miss
Predictable
Predictable
Small W
Buffer hit
Buffer missPredictable
Unpredictable
![Page 17: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/17.jpg)
Exploiting SSD Internal Information“Decomposition & Classification”
Multi-channel Multi-wayHigh die count NAND array
DRAM SSD Controller
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R
Cache hit
Cache miss
Predictable
Predictable
Small W
Buffer hit
Buffer missPredictable
Unpredictable
![Page 18: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/18.jpg)
Mitigating the Impact of Scheduling Delays
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R
Cache hit
Cache miss
Predictable
Predictable
Small W
Buffer hit
Buffer missPredictable
Unpredictable
Issue Sideuser thread context
OS I/O Path
PredictableNAND Read
Completion Sideuser thread context
Hard IRQDeferred Processing
![Page 19: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/19.jpg)
Mitigating the Impact of Scheduling Delays
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R
Cache hit
Cache miss
Predictable
Predictable
Small W
Buffer hit
Buffer missPredictable
Unpredictable
Issue Sideuser thread context
OS I/O Path
PredictableNAND Read
Completion Sideuser thread context
Hard IRQDeferred Processing
Issue Sideuser thread context
Completion Sideuser thread context
PredictableBuffer Hits
![Page 20: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/20.jpg)
Accurate Latency Prediction: Remaining I/O Time
Remaining I/O time
Total SSD I/O time
I/O processing& Queuing delays
Flash I/O + ECC + DMA transfer
OS I/O Path
Issue Sideuser thread context
“Only for NAND reads”
Classification“Small Read”
LatencyPredictor
![Page 21: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/21.jpg)
Precompletions: Overlapping I/O & Scheduling Delay
Precompletionwindow
Total SSD I/O time
PrecompletionIRQ
Actual Completion(Flag update: No IRQ)
OS I/O Path
Completion Sideuser thread context
Hard IRQDeferred Processing
Issue Sideuser thread context
Classification“Small Read”
Busy wait for flag(waiting on L1)
Remaining I/O time
PrecompletionWait period
I/O processing& Queuing delays
Flash I/O + ECC + DMA transfer“Only for NAND reads”
LatencyPredictor
![Page 22: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/22.jpg)
OS & SSD Interaction: Simple Behavioral Models
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R
Cache hit
Cache miss
Predictable
Predictable
Small W
Buffer hit
Buffer missPredictable
Unpredictable
![Page 23: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/23.jpg)
Blocks to consume
In-band Communication Channel“Piggybacked Information”
OS I/O Path
I/OClassifier
BufferMonitor
Previous I/O completion I/O request
BufferMonitor
CurrentBlocks
!
![Page 24: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/24.jpg)
Host system
CustomBlock Driver
Application
FIO
OS: Linux 3.5.40
Implementation & Evaluation
Zync-7000FPGA + ARM Cortex A9Dual core
128GB NANDModule x 1
(4ch, 4way each)
External PCI-eGen 2 four lane connection(Operating in Gen2 one lane) PCI-e external
adaptor
DDR3 DRAM512MB x 2
http://www.openssd-project.org/wiki/The_OpenSSD_Project
FTL
NANDController
MSI IRQ
NVM-e likeI/F
![Page 25: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/25.jpg)
Host system
CustomBlock Driver
Application
FIO
OS: Linux 3.5.40
Implementation & Evaluation
Zync-7000FPGA + ARM Cortex A9Dual core
128GB NANDModule x 1
(4ch, 4way each)
External PCI-eGen 2 four lane connection(Operating in Gen2 one lane) PCI-e external
adaptor
DDR3 DRAM512MB x 2
http://www.openssd-project.org/wiki/The_OpenSSD_Project
FTL
NANDController
MSI IRQ
NVM-e likeI/F
Limitation• Single I/O Depth• Unoptimized FPGA NAND Controller (Higher Latencies)
• Fixed latency• Slow DMA transfers (low freq. bus)• PCI-e Gen2 one lane
Latency Prediction
Impact of precompletion
![Page 26: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/26.jpg)
Predicting SSD Latency (Small NAND Read)
Flash: NAND I/O + ECC Prediction: three value moving average
DMA: device to host transfer (4kB) Low varianceLow Error
“Predictable”
![Page 27: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/27.jpg)
The Impact of Precompletionfio I/O thread vs background threads
Non-NAND Avg. latency: Measured AVG latency – 352us (NAND latency)
![Page 28: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/28.jpg)
The Impact of Precompletionfio I/O thread vs background threads
Non-NAND Avg. latency: Measured AVG latency – 352us (NAND latency)
I/O vs CPU requiresPriority boost for
Polling(priority degradataion)
Polling damages system parallelism
![Page 29: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/29.jpg)
The Impact of Precompletionfio I/O thread vs background threads
Non-NAND Avg. latency: Measured AVG latency – 352us (NAND latency)
Overshoot harms parallelism (busy wait)
Precompletion requires an adequate precompletion
window
Undershoot will expose scheduling delays
![Page 30: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/30.jpg)
The Impact of Precompletionfio I/O thread vs background threads
Non-NAND Avg. latency: Measured AVG latency – 352us (NAND latency)
Approx. 20% degradation of
system parallelismIRQ vs
PrecompletionvsCPU: 7.16 us gainvsIO: 7.52 us gainWith no degradation in system parallelism
![Page 31: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/31.jpg)
Summary• Unblinding the OS - Cross layer optimization
• Achieved a partially predictable SSD “decomposition / classification”• Exploit SSD internal information - “Remaining I/O time”• Protecting SSD proprietary internals – “Abstracted behavioral models”
• Mitigating scheduling delays• Exploiting predictability of certain I/O requests• Pre-completion - Projection (1 I/O depth vs other threads)
IRQ Polling This work
Latency Bad Good Good
Parallelism Good Bad Good
![Page 32: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/32.jpg)
Future Work• Future Implementation & Evaluation
• Full blown SSD• Projection (1 I/O depth – this work)
à Simulation (varying tech latency & etc)à Real implementation
• Cross layer optimization• More models• More use cases• More backend technologies rather than flash
![Page 33: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side](https://reader034.vdocuments.site/reader034/viewer/2022042710/5f67f5ccd04c8871931c50a5/html5/thumbnails/33.jpg)