unblindingthe+os+to+optimize+ … · cache+hit cache+miss predictable predictable small+w...

Unblinding the OS to Optimize User-Perceived Flash SSD Latency

Woong Shin*, Jaehyun Park**, Heon Y. Yeom*

*Seoul National University**Arizona State University

USENIX HotStorage 2016Jun. 21, 2016

OS I/O Path Optimizations: Reducing S/W Overheads

Simplified I/O Path

Application

StorageDevice

Issue Sideuser thread context

Completion Sideuser thread context

OS I/O Path

Overheadfrom 13 usdown to

0.2 ~ 2.5 us

Under 20usTechnology

Memory technology with ultra low (nanoseconds), predictable latencies.

“Block CPU (poll) while waiting for the I/O to complete”

Cannot use polling: Wastes CPU cyclesHarms system parallelism



OS I/O Path

CPU

(i.e., read: 20us ~ 150us)Higher latency & High Variability

Hard IRQDeferred Processing

CPUCPU

To Block the CPU (sync) or to Yield the CPU (async)

H/W context switch := 4 us ~ 5 usOS involved switch := 7 us ~ 8 us

“Impact of Modern SSDs”More contexts on a CPU coreLarger scheduling delays

700,000 IOPS NVM-e SSDBandwidth:4kB x 700 kIOPS = 2.8 GB/s

Impact of Modern SSDs

?!?Latency:1 sec / 700 kIOPS = 1.42 µs


700,000 IOPS NVM-e SSDSingle NAND die: apprx. 14,285 8kB IOPS (i.e, 70us read latency)Requires more than 49 NAND dies to achieve 700,000 IOPS

Multi-channel Multi-wayHigh die count NAND array

LargeDRAM

PowerfulControllers


700,000 IOPS NVM-e SSDSingle NAND die: apprx. 14,285 8kB IOPS (i.e, 70us read latency)Requires more than 49 NAND dies to achieve 700,000 IOPS


LargeDRAM

PowerfulControllers

High count of I/O contexts (threads, state machines) required for high IOPS

Higher Context Multiplexing Cost“Scheduling Delays”


LargeDRAM

PowerfulControllers

NAND die count >>> CPU core counti.e., Four PCI-e 3.0 4lane slots in a chassis, more SSDs

Redundancy, more capacity ...

CPU CPU

CPU Core

Worker threadWith multipleI/O contexts(Async. I/O)

Scheduler

CPU Core

Time sharing a CPU core(Sync. I/O)

ContextTo

ContextContextTo

Context



OS I/O Path

The OS is Blind: Conservative Strategies

SSDControllerDRAM



OS I/O Path

20us ~ 10msHigher latency, Higher Variance


The OS is Blind: Conservative Strategies

DRAMR W R

W E

R R

Physical Destination within the SSD

R

R

R RE

W

NAND dies

Externally

Experienced

Latency

SSDController

This Work: Unblinding the OSIssue Side

user thread contextCompletion Sideuser thread context

OS I/O Path


H/W context switch := 4 us ~ 5 usOS involved switch := 7 us ~ 8 usMore contexts on a CPU core(higher)

“Predictable SSD Latency”



OS I/O Path

I/O destination

Latency

OS SSD





OS I/O Path


Latency

I/O destinationx I/O type

DRAM NANDSmallR/W Small READ

NANDElse

Predictable Predictable UnpredictableA B C

DRAM NAND NANDSmallR/W Small READ Else

Latency



BetterInteraction

ExtendedInterfaceOS SSD

Internal KnowledgeExposed Knowledge“A simplified model”




OS I/O Path


Latency


DRAM NANDSmallR/W Small READ

NANDElse


SmallR/W Small READ Else

Latency



BetterInteraction

ExtendedInterfaceOS SSD

Internal KnowledgeExposed Knowledge“A simplified model”


Exploiting SSD Internal InformationTowards a “Predictable SSD”New Optimization Opportunities


DRAM SSD Controller

Exploiting SSD Internal Information

OS SSD

The UnpredictableSSD

Exploiting SSD Internal Information“Decomposition & Classification”


DRAM SSD Controller

DRAM

NAND

OS SSD

Destination

Decomposition

I/O request Classification

Small R Small W Large R Large W

Cache hit Buffer hit N/A N/A

Interleaved writes

Cache miss Buffer miss Interleavedreads

Predictable Predictable

Predictable Unpredictable No benefit No benefit

Exploiting SSD Internal Information“Decomposition & Classification”


DRAM SSD Controller

DRAM

NAND

OS SSD

Destination

Decomposition


Small R

Cache hit

Cache miss

Predictable

Predictable

Small W

Buffer hit

Buffer missPredictable

Unpredictable

Mitigating the Impact of Scheduling Delays

DRAM

NAND

OS SSD

Destination

Decomposition


Small R

Cache hit

Cache miss

Predictable

Predictable

Small W

Buffer hit


Unpredictable


OS I/O Path

PredictableNAND Read



Mitigating the Impact of Scheduling Delays

DRAM

NAND

OS SSD

Destination

Decomposition


Small R

Cache hit

Cache miss

Predictable

Predictable

Small W

Buffer hit


Unpredictable


OS I/O Path

PredictableNAND Read





PredictableBuffer Hits

Accurate Latency Prediction: Remaining I/O Time

Remaining I/O time

Total SSD I/O time

I/O processing& Queuing delays

Flash I/O + ECC + DMA transfer

OS I/O Path


“Only for NAND reads”

Classification“Small Read”

LatencyPredictor

Precompletions: Overlapping I/O & Scheduling Delay

Precompletionwindow

Total SSD I/O time

PrecompletionIRQ

Actual Completion(Flag update: No IRQ)

OS I/O Path




Classification“Small Read”

Busy wait for flag(waiting on L1)

Remaining I/O time

PrecompletionWait period

I/O processing& Queuing delays

Flash I/O + ECC + DMA transfer“Only for NAND reads”

LatencyPredictor

OS & SSD Interaction: Simple Behavioral Models

DRAM

NAND

OS SSD

Destination

Decomposition


Small R

Cache hit

Cache miss

Predictable

Predictable

Small W

Buffer hit


Unpredictable

Blocks to consume

In-band Communication Channel“Piggybacked Information”

OS I/O Path

I/OClassifier

BufferMonitor

Previous I/O completion I/O request

BufferMonitor

CurrentBlocks

!

Host system

CustomBlock Driver

Application

FIO

OS: Linux 3.5.40

Implementation & Evaluation

Zync-7000FPGA + ARM Cortex A9Dual core

128GB NANDModule x 1

(4ch, 4way each)

External PCI-eGen 2 four lane connection(Operating in Gen2 one lane) PCI-e external

adaptor

DDR3 DRAM512MB x 2

http://www.openssd-project.org/wiki/The_OpenSSD_Project

FTL

NANDController

MSI IRQ

NVM-e likeI/F

Host system

CustomBlock Driver

Application

FIO

OS: Linux 3.5.40

Implementation & Evaluation

Zync-7000FPGA + ARM Cortex A9Dual core

128GB NANDModule x 1

(4ch, 4way each)

External PCI-eGen 2 four lane connection(Operating in Gen2 one lane) PCI-e external

adaptor

DDR3 DRAM512MB x 2

http://www.openssd-project.org/wiki/The_OpenSSD_Project

FTL

NANDController

MSI IRQ

NVM-e likeI/F

Limitation• Single I/O Depth• Unoptimized FPGA NAND Controller (Higher Latencies)

• Fixed latency• Slow DMA transfers (low freq. bus)• PCI-e Gen2 one lane

Latency Prediction

Impact of precompletion

Predicting SSD Latency (Small NAND Read)

Flash: NAND I/O + ECC Prediction: three value moving average

DMA: device to host transfer (4kB) Low varianceLow Error

“Predictable”

The Impact of Precompletionfio I/O thread vs background threads

Non-NAND Avg. latency: Measured AVG latency – 352us (NAND latency)



I/O vs CPU requiresPriority boost for

Polling(priority degradataion)

Polling damages system parallelism



Overshoot harms parallelism (busy wait)

Precompletion requires an adequate precompletion

window

Undershoot will expose scheduling delays



Approx. 20% degradation of

system parallelismIRQ vs

PrecompletionvsCPU: 7.16 us gainvsIO: 7.52 us gainWith no degradation in system parallelism

Summary• Unblinding the OS - Cross layer optimization

• Achieved a partially predictable SSD “decomposition / classification”• Exploit SSD internal information - “Remaining I/O time”• Protecting SSD proprietary internals – “Abstracted behavioral models”

• Mitigating scheduling delays• Exploiting predictability of certain I/O requests• Pre-completion - Projection (1 I/O depth vs other threads)

IRQ Polling This work

Latency Bad Good Good

Parallelism Good Bad Good

Future Work• Future Implementation & Evaluation

• Full blown SSD• Projection (1 I/O depth – this work)

à Simulation (varying tech latency & etc)à Real implementation

• Cross layer optimization• More models• More use cases• More backend technologies rather than flash

unblindingthe+os+to+optimize+ … · cache+hit cache+miss predictable predictable small+w...

Documents