unblindingthe+os+to+optimize+ … · cache+hit cache+miss predictable predictable small+w...
TRANSCRIPT
Unblinding the OS to Optimize User-Perceived Flash SSD Latency
Woong Shin*, Jaehyun Park**, Heon Y. Yeom*
*Seoul National University**Arizona State University
USENIX HotStorage 2016Jun. 21, 2016
OS I/O Path Optimizations: Reducing S/W Overheads
Simplified I/O Path
Application
StorageDevice
Issue Sideuser thread context
Completion Sideuser thread context
OS I/O Path
Overheadfrom 13 usdown to
0.2 ~ 2.5 us
Under 20usTechnology
Memory technology with ultra low (nanoseconds), predictable latencies.
“Block CPU (poll) while waiting for the I/O to complete”
Cannot use polling: Wastes CPU cyclesHarms system parallelism
Issue Sideuser thread context
Completion Sideuser thread context
OS I/O Path
CPU
(i.e., read: 20us ~ 150us)Higher latency & High Variability
Hard IRQDeferred Processing
CPUCPU
To Block the CPU (sync) or to Yield the CPU (async)
H/W context switch := 4 us ~ 5 usOS involved switch := 7 us ~ 8 us
“Impact of Modern SSDs”More contexts on a CPU coreLarger scheduling delays
700,000 IOPS NVM-e SSDBandwidth:4kB x 700 kIOPS = 2.8 GB/s
Impact of Modern SSDs
?!?Latency:1 sec / 700 kIOPS = 1.42 µs
Impact of Modern SSDs
700,000 IOPS NVM-e SSDSingle NAND die: apprx. 14,285 8kB IOPS (i.e, 70us read latency)Requires more than 49 NAND dies to achieve 700,000 IOPS
Multi-channel Multi-wayHigh die count NAND array
LargeDRAM
PowerfulControllers
Impact of Modern SSDs
700,000 IOPS NVM-e SSDSingle NAND die: apprx. 14,285 8kB IOPS (i.e, 70us read latency)Requires more than 49 NAND dies to achieve 700,000 IOPS
Multi-channel Multi-wayHigh die count NAND array
LargeDRAM
PowerfulControllers
High count of I/O contexts (threads, state machines) required for high IOPS
Higher Context Multiplexing Cost“Scheduling Delays”
Multi-channel Multi-wayHigh die count NAND array
LargeDRAM
PowerfulControllers
NAND die count >>> CPU core counti.e., Four PCI-e 3.0 4lane slots in a chassis, more SSDs
Redundancy, more capacity ...
CPU CPU
CPU Core
Worker threadWith multipleI/O contexts(Async. I/O)
Scheduler
CPU Core
Time sharing a CPU core(Sync. I/O)
ContextTo
ContextContextTo
Context
Issue Sideuser thread context
Completion Sideuser thread context
OS I/O Path
The OS is Blind: Conservative Strategies
SSDControllerDRAM
Issue Sideuser thread context
Completion Sideuser thread context
OS I/O Path
20us ~ 10msHigher latency, Higher Variance
Hard IRQDeferred Processing
The OS is Blind: Conservative Strategies
DRAMR W R
W E
R R
Physical Destination within the SSD
R
R
R RE
W
NAND dies
Externally
Experienced
Latency
SSDController
This Work: Unblinding the OSIssue Side
user thread contextCompletion Sideuser thread context
OS I/O Path
Hard IRQDeferred Processing
H/W context switch := 4 us ~ 5 usOS involved switch := 7 us ~ 8 usMore contexts on a CPU core(higher)
“Predictable SSD Latency”
This Work: Unblinding the OSIssue Side
user thread contextCompletion Sideuser thread context
OS I/O Path
I/O destination
Latency
OS SSD
“Predictable SSD Latency”
Hard IRQDeferred Processing
This Work: Unblinding the OSIssue Side
user thread contextCompletion Sideuser thread context
OS I/O Path
“Predictable SSD Latency”
Latency
I/O destinationx I/O type
DRAM NANDSmallR/W Small READ
NANDElse
Predictable Predictable UnpredictableA B C
DRAM NAND NANDSmallR/W Small READ Else
Latency
I/O destinationx I/O type
Predictable Predictable UnpredictableA B C
BetterInteraction
ExtendedInterfaceOS SSD
Internal KnowledgeExposed Knowledge“A simplified model”
Hard IRQDeferred Processing
This Work: Unblinding the OSIssue Side
user thread contextCompletion Sideuser thread context
OS I/O Path
“Predictable SSD Latency”
Latency
I/O destinationx I/O type
DRAM NANDSmallR/W Small READ
NANDElse
Predictable Predictable UnpredictableA B C
SmallR/W Small READ Else
Latency
I/O destinationx I/O type
Predictable Predictable UnpredictableA B C
BetterInteraction
ExtendedInterfaceOS SSD
Internal KnowledgeExposed Knowledge“A simplified model”
Hard IRQDeferred Processing
Exploiting SSD Internal InformationTowards a “Predictable SSD”New Optimization Opportunities
Multi-channel Multi-wayHigh die count NAND array
DRAM SSD Controller
Exploiting SSD Internal Information
OS SSD
The UnpredictableSSD
Exploiting SSD Internal Information“Decomposition & Classification”
Multi-channel Multi-wayHigh die count NAND array
DRAM SSD Controller
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R Small W Large R Large W
Cache hit Buffer hit N/A N/A
Interleaved writes
Cache miss Buffer miss Interleavedreads
Predictable Predictable
Predictable Unpredictable No benefit No benefit
Exploiting SSD Internal Information“Decomposition & Classification”
Multi-channel Multi-wayHigh die count NAND array
DRAM SSD Controller
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R
Cache hit
Cache miss
Predictable
Predictable
Small W
Buffer hit
Buffer missPredictable
Unpredictable
Exploiting SSD Internal Information“Decomposition & Classification”
Multi-channel Multi-wayHigh die count NAND array
DRAM SSD Controller
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R
Cache hit
Cache miss
Predictable
Predictable
Small W
Buffer hit
Buffer missPredictable
Unpredictable
Mitigating the Impact of Scheduling Delays
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R
Cache hit
Cache miss
Predictable
Predictable
Small W
Buffer hit
Buffer missPredictable
Unpredictable
Issue Sideuser thread context
OS I/O Path
PredictableNAND Read
Completion Sideuser thread context
Hard IRQDeferred Processing
Mitigating the Impact of Scheduling Delays
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R
Cache hit
Cache miss
Predictable
Predictable
Small W
Buffer hit
Buffer missPredictable
Unpredictable
Issue Sideuser thread context
OS I/O Path
PredictableNAND Read
Completion Sideuser thread context
Hard IRQDeferred Processing
Issue Sideuser thread context
Completion Sideuser thread context
PredictableBuffer Hits
Accurate Latency Prediction: Remaining I/O Time
Remaining I/O time
Total SSD I/O time
I/O processing& Queuing delays
Flash I/O + ECC + DMA transfer
OS I/O Path
Issue Sideuser thread context
“Only for NAND reads”
Classification“Small Read”
LatencyPredictor
Precompletions: Overlapping I/O & Scheduling Delay
Precompletionwindow
Total SSD I/O time
PrecompletionIRQ
Actual Completion(Flag update: No IRQ)
OS I/O Path
Completion Sideuser thread context
Hard IRQDeferred Processing
Issue Sideuser thread context
Classification“Small Read”
Busy wait for flag(waiting on L1)
Remaining I/O time
PrecompletionWait period
I/O processing& Queuing delays
Flash I/O + ECC + DMA transfer“Only for NAND reads”
LatencyPredictor
OS & SSD Interaction: Simple Behavioral Models
DRAM
NAND
OS SSD
Destination
Decomposition
I/O request Classification
Small R
Cache hit
Cache miss
Predictable
Predictable
Small W
Buffer hit
Buffer missPredictable
Unpredictable
Blocks to consume
In-band Communication Channel“Piggybacked Information”
OS I/O Path
I/OClassifier
BufferMonitor
Previous I/O completion I/O request
BufferMonitor
CurrentBlocks
!
Host system
CustomBlock Driver
Application
FIO
OS: Linux 3.5.40
Implementation & Evaluation
Zync-7000FPGA + ARM Cortex A9Dual core
128GB NANDModule x 1
(4ch, 4way each)
External PCI-eGen 2 four lane connection(Operating in Gen2 one lane) PCI-e external
adaptor
DDR3 DRAM512MB x 2
http://www.openssd-project.org/wiki/The_OpenSSD_Project
FTL
NANDController
MSI IRQ
NVM-e likeI/F
Host system
CustomBlock Driver
Application
FIO
OS: Linux 3.5.40
Implementation & Evaluation
Zync-7000FPGA + ARM Cortex A9Dual core
128GB NANDModule x 1
(4ch, 4way each)
External PCI-eGen 2 four lane connection(Operating in Gen2 one lane) PCI-e external
adaptor
DDR3 DRAM512MB x 2
http://www.openssd-project.org/wiki/The_OpenSSD_Project
FTL
NANDController
MSI IRQ
NVM-e likeI/F
Limitation• Single I/O Depth• Unoptimized FPGA NAND Controller (Higher Latencies)
• Fixed latency• Slow DMA transfers (low freq. bus)• PCI-e Gen2 one lane
Latency Prediction
Impact of precompletion
Predicting SSD Latency (Small NAND Read)
Flash: NAND I/O + ECC Prediction: three value moving average
DMA: device to host transfer (4kB) Low varianceLow Error
“Predictable”
The Impact of Precompletionfio I/O thread vs background threads
Non-NAND Avg. latency: Measured AVG latency – 352us (NAND latency)
The Impact of Precompletionfio I/O thread vs background threads
Non-NAND Avg. latency: Measured AVG latency – 352us (NAND latency)
I/O vs CPU requiresPriority boost for
Polling(priority degradataion)
Polling damages system parallelism
The Impact of Precompletionfio I/O thread vs background threads
Non-NAND Avg. latency: Measured AVG latency – 352us (NAND latency)
Overshoot harms parallelism (busy wait)
Precompletion requires an adequate precompletion
window
Undershoot will expose scheduling delays
The Impact of Precompletionfio I/O thread vs background threads
Non-NAND Avg. latency: Measured AVG latency – 352us (NAND latency)
Approx. 20% degradation of
system parallelismIRQ vs
PrecompletionvsCPU: 7.16 us gainvsIO: 7.52 us gainWith no degradation in system parallelism
Summary• Unblinding the OS - Cross layer optimization
• Achieved a partially predictable SSD “decomposition / classification”• Exploit SSD internal information - “Remaining I/O time”• Protecting SSD proprietary internals – “Abstracted behavioral models”
• Mitigating scheduling delays• Exploiting predictability of certain I/O requests• Pre-completion - Projection (1 I/O depth vs other threads)
IRQ Polling This work
Latency Bad Good Good
Parallelism Good Bad Good
Future Work• Future Implementation & Evaluation
• Full blown SSD• Projection (1 I/O depth – this work)
à Simulation (varying tech latency & etc)à Real implementation
• Cross layer optimization• More models• More use cases• More backend technologies rather than flash