flashix - platformlab.stanford.edu · • nvm express (nvme) enables scalable and efficient access...

40
FlashIX: High Performance Remote Flash Ana Klimovic Heiner Litz Christos Kozyrakis February 2, 2016

Upload: lyxuyen

Post on 30-Jun-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

FlashIX:

HighPerformanceRemoteFlash

AnaKlimovicHeinerLitzChristosKozyrakis

February2,2016

FlashinDatacenters

•  Flashisreplacingharddrives–  1000xhigherthroughput,20xlowerlatency–  GBand$areapproachingparity

•  NVMExpress(NVMe)enablesscalableandefficientaccesstohigh-performanceFlash:–  Lowlatency:10sofµs–  Highthroughput:100,000sofIOPS–  MulX-queueinterface–  Event-based,non-blocking

AccessingFlash

•  LocalPCIeaccessvs.remoteaccessovernetwork

3

WhyremoteFlash?

4

ResourceDisaggregaXon

•  ApplicaXonshavedifferentstorageneeds

•  RemoteaccesstodisaggregatedFlashenablesindependentandelasXcresourcescaling

•  ImproveresourceuXlizaXon

5

ServerResourceUXlizaXon•  FlashandCPUuXlizaXonvaryoverXmeandscaleseparately.

6DatasampledfromFacebookservershos7ngaFlash-basedKVSservice.

uXlizaXon

ServerResourceUXlizaXon•  FlashandCPUuXlizaXonvaryoverXmeandscaleseparately.

7DatasampledfromFacebookservershos7ngaFlash-basedKVSservice.

FlashandCPUuXlizaXonvarywithseparatetrends

uXlizaXon

Imbalancedresourceu7liza7on

ServerResourceUXlizaXon•  FlashandCPUuXlizaXonvaryoverXmeandscaleseparately.

8DatasampledfromFacebookservershos7ngaFlash-basedKVSservice.

FlashisoverprovisionedforlongperiodsofXme

uXlizaXon

Imbalancedresourceu7liza7on

ResourceDisaggregaXon•  ApplicaXonshavedifferentstorageneeds

•  RemoteaccesstodisaggregatedFlash:EnablesindependentandelasXcresourcescalingIncreasesHWuXlizaXon,decreasesfragmentaXonCentralizedsnapshodng,checkpoinXngIncreasesparallelism,striping,RAIDDecreasesstoragecosts

•  Successfulapproachforharddrives(SANs)

9

RemoteFlashOverhead•  iSCSInetworkstorageprotocolisCPU-intensive

10

0

200

400

600

800

1000

0 50 100 150 200 250 300 350

Read

Laten

cy(u

s)

IOPS Thousands

4kBrandread

Linux-local-p99Linux-iSCSI-p99

75%throughputdrop

3xlatency

RemoteFlashCost

11

0

50

100

150

200

250

300

0 2 4 6 8 10

PeakIO

PS Thou

sand

s

#CPUcoresonFlashserver

4kBrandread

local

•  MoreservercoresandclientconnecXonsàmoreIOPS•  Cost=#coresonserverneedforprotocolprocessing

1clientconn,1servercore

RemoteFlashCost

12

0

50

100

150

200

250

300

0 2 4 6 8 10

PeakIO

PS Thou

sand

s

#CPUcoresonFlashserver

4kBrandread

local

10conns

•  MoreservercoresandclientconnecXonsàmoreIOPS•  Cost=#coresonserverneedforprotocolprocessing

RemoteFlashRequirements

1.Highperformance(atthetail)2.IsolaXonandresourcemanagement(useOS)3.Lowcost(usecommodityHW)

13

TradiXonalApproaches

14

Performance ProtecDon/IsolaDon Cost

Linux+iSCSI+CommodityEthernet

Linux+RDMA

User-level+CommodityEthernet(ex:mTCP)

TradiXonalApproaches

15

Performance ProtecDon/IsolaDon Cost

Linux+iSCSI+CommodityEthernet X √ √Linux+RDMA √ √ XUser-level+CommodityEthernet(ex:mTCP) √ X √

TradiXonalApproaches

16

Performance ProtecDon/IsolaDon Cost

Linux+iSCSI+CommodityEthernet X √ √Linux+RDMA √ √ XUser-level+CommodityEthernet(ex:mTCP) √ X √FlashIX+CommodityEthernet √ √ √

HowshouldwearchitecttheoperaXngsystemforhighperformancestorageI/O?

17

1.  Dataplane–  Highperformancedatapath

2.  Controlplane–  Resourcemanagement:Cores,Eth,Flash–  StoragecapacityallocaXon–  QualityofService

18

FlashIXDesign

FlashIXArchitecture

Linuxkernel

FlashIXControlPlane

RXTX

CQSQ

RXTX

CQSQ

Ring3

GuestRing0

HostRing0

FlashIX FlashIX

Core Core Core Core

Dune

libIX libIX

AppA AppB

19

FlashIXArchitecture

Linuxkernel

FlashIXControlPlane

RXTX

CQSQ

RXTX

CQSQ

Ring3

GuestRing0

HostRing0

FlashIX FlashIX

Core Core Core Core

Dune

libIX libIX

AppA AppB

20

ExecuXonModel

Event-drivenapp

libIX

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

FlashDevice

Re-order

21

TX

FlashIX

ExecuXonModel

FlashcachedServer

libIX

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

FlashDevice

Re-order

22

TX1

ReceivePUTRequest

ExecuXonModel

FlashcachedServer

libIX

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

FlashDevice

Re-order

23

TX

EventtriggersRX-ReceiveCallback2

ExecuXonModel

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

FlashDevice

Re-order

24

TX

WriteSyscallFlashcachedServer

libIX

3

ExecuXonModel

FlashcachedServer

libIX

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

FlashDevice

Re-order

25

TX

NVMEWrite

4

ExecuXonModel

FlashcachedServer

libIX

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

FlashDevice

Re-order

26

TX

NVMECompleXon5

ExecuXonModel

FlashcachedServer

libIX

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

FlashDevice

Re-order

27

TX

NVMECompleXonEvent6

ExecuXonModel

FlashcachedServer

libIX

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

FlashDevice

Re-order

28

TX

SendTCPSyscall7

ExecuXonModel

FlashcachedServer

libIX

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

FlashDevice

Re-order

29

TX

SendPUTACK8

ExecuXonModel

Event-drivenapp

libIX

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

Re-order

30

1.  Event-DrivenAPI:-Supports100,000’sofRequests-1threadpercore-1000’sconnecXonsperthread

TX

FlashDevice

ExecuXonModel

Event-drivenapp

libIX

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

Re-order

31

2.ProcesstoCompleXon

TX

FlashDevice

ExecuXonModel

Event-drivenapp

libIX

RX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

Re-order

32

3.Zero-Copy:RDMAlikeperformance

TX

FlashDevice

ExecuXonModel

Event-drivenapp

libIX

RX TX

TCP/IP

EventCondiXons

BatchedSyscalls

Ring3

GuestRing0

NVMe TCP/IP

NVMe

CQ SQ

Re-order

33

4.OnequeueperconnecXon-AvoidsHOLblocking-  QoS-  Scheduling

FlashDevice

EvaluaXonMethodology

•  Flashcached:PersistentKV-StoreapplicaXon– 4KBvalues– Memcachedbinaryprotocol– MuXlateloadgeneratorclients

•  [email protected]•  Intel750PCIeFlashSSD•  Intelx52010GbE

34

FlashIXPerformance

35

0

200

400

600

800

1000

0 100 200 300 400 500

Read

Laten

cy(u

s)

IOPS Thousands

4kBrandread–1ServerThread

Linux-local-p99

SPDK-local-p99

Linux-iSCSI-p99

•  BaselineLinuxanduser-spaceSPDKperformance

FlashIXPerformance

36

0

200

400

600

800

1000

0 100 200 300 400 500

Read

Laten

cy(u

s)

IOPS Thousands

4kBrandread–1ServerThread

Linux-local-p99

SPDK-local-p99

Linux-iSCSI-p99

FlashIX

•  RemoteFlashwithFlashIX~=localFlashwithLinux

FlashIXPerformance

37

•  RemoteFlashwithFlashIX~=localFlashwithLinux

0

200

400

600

800

1000

0 100 200 300 400 500

Read

Laten

cy(u

s)

IOPS Thousands

4kBrandread–1ServerThread

Linux-local-p99

SPDK-local-p99

Linux-iSCSI-p99

FlashIX

FlashIXPerformance

38

•  RemoteFlashwithFlashIX~=localFlashwithLinux

4KB*300KIOPS=10Gbit/s

0

200

400

600

800

1000

0 100 200 300 400 500

Read

Laten

cy(u

s)

IOPS Thousands

4kBrandread–1ServerThread

Linux-local-p99

SPDK-local-p99

Linux-iSCSI-p99

FlashIX

FutureWork

•  RemoveNICBorleneckà2x10GbE•  UXlizenewSamsungSSDs•  WhatAPIshouldweexposetoapplicaXons?–  Blocklevel?Filesystem?–  Localvs.remotestorageAPI?

•  ApplicaXonsanduse-casesforflashstorage–  Latency-sensiXve– Highthroughput(saturatelocalPCIeFlashIOPS)–  Scaleout

39

Conclusion

•  Remote≈localFlashaccesslatency•  RequiresrethinkingoftheOS&storageSWstack•  FlashIXisastoragedataplaneOSdesignedfor:– LowtaillatencyandhighthroughputaccesstoFlashovercommodity10GbEthernet

–  IsolaXonandresourcemanagement

•  EnablesstoragedisaggregaXonforFlash

40