flashix - platformlab.stanford.edu · • nvm express (nvme) enables scalable and efficient access...
TRANSCRIPT
FlashinDatacenters
• Flashisreplacingharddrives– 1000xhigherthroughput,20xlowerlatency– GBand$areapproachingparity
• NVMExpress(NVMe)enablesscalableandefficientaccesstohigh-performanceFlash:– Lowlatency:10sofµs– Highthroughput:100,000sofIOPS– MulX-queueinterface– Event-based,non-blocking
ResourceDisaggregaXon
• ApplicaXonshavedifferentstorageneeds
• RemoteaccesstodisaggregatedFlashenablesindependentandelasXcresourcescaling
• ImproveresourceuXlizaXon
5
ServerResourceUXlizaXon• FlashandCPUuXlizaXonvaryoverXmeandscaleseparately.
6DatasampledfromFacebookservershos7ngaFlash-basedKVSservice.
uXlizaXon
ServerResourceUXlizaXon• FlashandCPUuXlizaXonvaryoverXmeandscaleseparately.
7DatasampledfromFacebookservershos7ngaFlash-basedKVSservice.
FlashandCPUuXlizaXonvarywithseparatetrends
uXlizaXon
Imbalancedresourceu7liza7on
ServerResourceUXlizaXon• FlashandCPUuXlizaXonvaryoverXmeandscaleseparately.
8DatasampledfromFacebookservershos7ngaFlash-basedKVSservice.
FlashisoverprovisionedforlongperiodsofXme
uXlizaXon
Imbalancedresourceu7liza7on
ResourceDisaggregaXon• ApplicaXonshavedifferentstorageneeds
• RemoteaccesstodisaggregatedFlash:EnablesindependentandelasXcresourcescalingIncreasesHWuXlizaXon,decreasesfragmentaXonCentralizedsnapshodng,checkpoinXngIncreasesparallelism,striping,RAIDDecreasesstoragecosts
• Successfulapproachforharddrives(SANs)
9
RemoteFlashOverhead• iSCSInetworkstorageprotocolisCPU-intensive
10
0
200
400
600
800
1000
0 50 100 150 200 250 300 350
Read
Laten
cy(u
s)
IOPS Thousands
4kBrandread
Linux-local-p99Linux-iSCSI-p99
75%throughputdrop
3xlatency
RemoteFlashCost
11
0
50
100
150
200
250
300
0 2 4 6 8 10
PeakIO
PS Thou
sand
s
#CPUcoresonFlashserver
4kBrandread
local
• MoreservercoresandclientconnecXonsàmoreIOPS• Cost=#coresonserverneedforprotocolprocessing
1clientconn,1servercore
RemoteFlashCost
12
0
50
100
150
200
250
300
0 2 4 6 8 10
PeakIO
PS Thou
sand
s
#CPUcoresonFlashserver
4kBrandread
local
10conns
• MoreservercoresandclientconnecXonsàmoreIOPS• Cost=#coresonserverneedforprotocolprocessing
RemoteFlashRequirements
1.Highperformance(atthetail)2.IsolaXonandresourcemanagement(useOS)3.Lowcost(usecommodityHW)
13
TradiXonalApproaches
14
Performance ProtecDon/IsolaDon Cost
Linux+iSCSI+CommodityEthernet
Linux+RDMA
User-level+CommodityEthernet(ex:mTCP)
TradiXonalApproaches
15
Performance ProtecDon/IsolaDon Cost
Linux+iSCSI+CommodityEthernet X √ √Linux+RDMA √ √ XUser-level+CommodityEthernet(ex:mTCP) √ X √
TradiXonalApproaches
16
Performance ProtecDon/IsolaDon Cost
Linux+iSCSI+CommodityEthernet X √ √Linux+RDMA √ √ XUser-level+CommodityEthernet(ex:mTCP) √ X √FlashIX+CommodityEthernet √ √ √
1. Dataplane– Highperformancedatapath
2. Controlplane– Resourcemanagement:Cores,Eth,Flash– StoragecapacityallocaXon– QualityofService
18
FlashIXDesign
FlashIXArchitecture
Linuxkernel
FlashIXControlPlane
RXTX
CQSQ
RXTX
CQSQ
Ring3
GuestRing0
HostRing0
FlashIX FlashIX
Core Core Core Core
Dune
libIX libIX
AppA AppB
19
FlashIXArchitecture
Linuxkernel
FlashIXControlPlane
RXTX
CQSQ
RXTX
CQSQ
Ring3
GuestRing0
HostRing0
FlashIX FlashIX
Core Core Core Core
Dune
libIX libIX
AppA AppB
20
ExecuXonModel
Event-drivenapp
libIX
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
FlashDevice
Re-order
21
TX
FlashIX
ExecuXonModel
FlashcachedServer
libIX
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
FlashDevice
Re-order
22
TX1
ReceivePUTRequest
ExecuXonModel
FlashcachedServer
libIX
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
FlashDevice
Re-order
23
TX
EventtriggersRX-ReceiveCallback2
ExecuXonModel
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
FlashDevice
Re-order
24
TX
WriteSyscallFlashcachedServer
libIX
3
ExecuXonModel
FlashcachedServer
libIX
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
FlashDevice
Re-order
25
TX
NVMEWrite
4
ExecuXonModel
FlashcachedServer
libIX
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
FlashDevice
Re-order
26
TX
NVMECompleXon5
ExecuXonModel
FlashcachedServer
libIX
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
FlashDevice
Re-order
27
TX
NVMECompleXonEvent6
ExecuXonModel
FlashcachedServer
libIX
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
FlashDevice
Re-order
28
TX
SendTCPSyscall7
ExecuXonModel
FlashcachedServer
libIX
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
FlashDevice
Re-order
29
TX
SendPUTACK8
ExecuXonModel
Event-drivenapp
libIX
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
Re-order
30
1. Event-DrivenAPI:-Supports100,000’sofRequests-1threadpercore-1000’sconnecXonsperthread
TX
FlashDevice
ExecuXonModel
Event-drivenapp
libIX
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
Re-order
31
2.ProcesstoCompleXon
TX
FlashDevice
ExecuXonModel
Event-drivenapp
libIX
RX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
Re-order
32
3.Zero-Copy:RDMAlikeperformance
TX
FlashDevice
ExecuXonModel
Event-drivenapp
libIX
RX TX
TCP/IP
EventCondiXons
BatchedSyscalls
Ring3
GuestRing0
NVMe TCP/IP
NVMe
CQ SQ
Re-order
33
4.OnequeueperconnecXon-AvoidsHOLblocking- QoS- Scheduling
FlashDevice
EvaluaXonMethodology
• Flashcached:PersistentKV-StoreapplicaXon– 4KBvalues– Memcachedbinaryprotocol– MuXlateloadgeneratorclients
• [email protected]• Intel750PCIeFlashSSD• Intelx52010GbE
34
FlashIXPerformance
35
0
200
400
600
800
1000
0 100 200 300 400 500
Read
Laten
cy(u
s)
IOPS Thousands
4kBrandread–1ServerThread
Linux-local-p99
SPDK-local-p99
Linux-iSCSI-p99
• BaselineLinuxanduser-spaceSPDKperformance
FlashIXPerformance
36
0
200
400
600
800
1000
0 100 200 300 400 500
Read
Laten
cy(u
s)
IOPS Thousands
4kBrandread–1ServerThread
Linux-local-p99
SPDK-local-p99
Linux-iSCSI-p99
FlashIX
• RemoteFlashwithFlashIX~=localFlashwithLinux
FlashIXPerformance
37
• RemoteFlashwithFlashIX~=localFlashwithLinux
0
200
400
600
800
1000
0 100 200 300 400 500
Read
Laten
cy(u
s)
IOPS Thousands
4kBrandread–1ServerThread
Linux-local-p99
SPDK-local-p99
Linux-iSCSI-p99
FlashIX
FlashIXPerformance
38
• RemoteFlashwithFlashIX~=localFlashwithLinux
4KB*300KIOPS=10Gbit/s
0
200
400
600
800
1000
0 100 200 300 400 500
Read
Laten
cy(u
s)
IOPS Thousands
4kBrandread–1ServerThread
Linux-local-p99
SPDK-local-p99
Linux-iSCSI-p99
FlashIX
FutureWork
• RemoveNICBorleneckà2x10GbE• UXlizenewSamsungSSDs• WhatAPIshouldweexposetoapplicaXons?– Blocklevel?Filesystem?– Localvs.remotestorageAPI?
• ApplicaXonsanduse-casesforflashstorage– Latency-sensiXve– Highthroughput(saturatelocalPCIeFlashIOPS)– Scaleout
39