orion: a distributed file system for non-volatile main ......1 orion: a distributed file system for...

32
1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz, Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego

Upload: others

Post on 12-Mar-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

1

ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks

Jian Yang, Joseph Izraelevitz, Steven Swanson

Non-Volatile Systems Laboratory

Department of Computer Science & Engineering

University of California, San Diego

Page 2: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

2

RDMA Network

• PCM, STT-RAM, ReRAM, Intel 3DXPoint

• Performant: DRAM-class latency/BW

• Byte-addressable

• Persistent over power failures

Non-Volatile Main Memory (NVMM)

Accessing NVMM as Remote Storage

Application

MemoryDRAM NVMM

Remote Direct Memory Access (RDMA)

• DMA from/to remote memory

• Two-sided verbs (Send/Recv)

• One-sided verbs (Read/Write)

• Bypasses remote CPU

• Byte-addressable

Application

DRAM NVMM

Application

DRAM NVMM

Application

DRAM NVMM

Page 3: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

3

Accessing Local NVMM vs. Remote NVMM

19x241x

19%5%

Latency of write(4KB) + fsync()

NVMMFS

iSCSI/RDMA

Ceph/RDMA

NVMMFS

Ceph/RDMA

iSCSI/RDMA

Throughput of fileserver workload

6653MB/s

3.86 us

Distributed FS

Application

NVMM

Distributed FS

Access remote NVMM over Dist. FS

Application

NVMM

NVMM FS

Access local NVMM

RDMA

(NOVA)

(NOVA)

Better

Better

Page 4: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

4

Issue #1: Existing Dist. FSs are Slow on NVMM

• Layered Design

• Indirection overhead

• Expensive to persist (e.g., fsync())

RPCStorage ServerStorage Client

Application

File System

NVMM

File Access

File Access

Block Access

Kernel

User

File System/FUSE

File Request

Page 5: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

5

NVMM is Faster than RDMA

RDMA Network NVMM

300 ns

2-16 GB/s

64 Byte (Cacheline)

3 μs

5 GB/s

2 KB (MTU)

Latency

Bandwidth

Access Size

Harddrive (NVMe)

70 μs

1.3-3.2 GB/s

4 KB (Page)(@ Max BW)

Networking is faster than storage

>>

<

Page 6: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

6

NVMM is Faster than RDMA

RDMA Network NVMM

300 ns

2-16 GB/s

64 Byte (Cacheline)

3 μs

5 GB/s

2 KB (MTU)

Latency

Bandwidth

Access Size

Harddrive (NVMe)

70 μs

1.3-3.2 GB/s

4 KB (Page)(@ Max BW)

>>

<

NVMM is faster than networking

Page 7: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

7

Issue #2: Lack of Support for Local NVMM

NVMM FS

Storage ServerStorage Client

Application

File System

Remote NVMMLocal NVMM

RDMA

Page Cache

• Use case of converged storage

– Local NVMM supports Direct Access (DAX)

• Existing systems do not store data at local

• Run Local FS and Dist. FS

– Expensive to move data DAX

Page 8: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

8

ORION: A Distributed File System forNVMM and RDMA-Capable Networks

• A clean slate design for NVMM and RDMA

• A unified layer: kernel FS + networking

• Pooled NVMM storage

• Accessing metadata/data directly over

Direct Access (DAX) and RDMA

• Designed for rack-scale scalability

ORION

Application Application

RDMA AccessDAX Access

POSIX I/O

NVMM NVMMPooled NVMM

Page 9: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

9

Outline

• Background

• Design overview

• Metadata and data management

• Replication

• RDMA persistence

• Evaluation

• Conclusion

Page 10: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

10

Data StoreData StoreData Store

Data

ClientClient

Client

MDS

Metadata

ORION: Cluster Overview

Sync/Update

DAXRead/Write

RDMA Read/Write

• Metadata Server (MDS): Runs ORIONFS,

keeps authoritative metadata of the whole FS

• Client: Runs ORIONFS, keeps active metadata

and cached data. Access local NVMM

• Data Store (DS): Pooled NVMM data

• Metadata Access: Clients <=> MDS (Two-sided)

• Data Access: Clients => DSs (One-sided)

NVMM

DRAM

NVMM

Metadata

Data $

Data

Page 11: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

11

The ORION File System

• Inherited from NOVA [Xu, FAST 16]

– Per-inode metadata (operation) log

– Build in-DRAM data structures on recovery

– Atomic log append

• Metadata:

– DMA (Physical) memory region (MR)

– RDMA-able metadata structures

• Data: Globally partitioned

MDS MDS

Client

Data Store

a

b

c

VFS

inodes

dentries

b

c

VFS

inodes

dentries

Data $

inode log

inode log

DataData [X][Y]

[$]

Page 12: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

12

Operations in ORION File System

• Open(a)– Allocate inode

MDS

Client

a

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQClient

RPC: OpenPath: aWAddr:&alloc

Page 13: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

13

Operations in ORION File System

• Open(a)– Allocate inode

– Issue an RPC via RDMA_Send

MDS

Client

a

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQClient

Client

Page 14: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

14

Operations in ORION File System

• Open(a)– Allocate inode

– Issue an RPC via RDMA_Send– RDMA_Write to allocated space

MDS

Client

a

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

aMDS

Client

Client

Page 15: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

15

Operations in ORION File System

• Open(a)– Allocate inode

– Issue an RPC via RDMA_Send– RDMA_Write to allocated space

– RDMA_Read the rest of the log

MDS

Client

a

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

aMDS

Client

Client

Client

Page 16: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

16

a

Operations in ORION File System

• Write(c)– Allocate & CoW to client-owned pages

MDS

Client

1

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

a

Client

Page 17: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

17

Operations in ORION File System

• Write(c)– Allocate & CoW to client-owned pages

– Append log entry

MDS

Client

1

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

a

a

Client

Client

Log: FileWriteAddr=(X,addr)Size=4096

Page 18: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

18

Operations in ORION File System

• Write(c)– Allocate & CoW to client-owned pages

– Append log entry

– Commit log entry via RDMA_Send

MDS

Client

1

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

a

a

Client

Client

Client

Log: FileWriteAddr=(X,addr)Size=4096

Page 19: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

19

Operations in ORION File System

• Write(c)– Allocate & CoW to client-owned pages

– Append log entry

– Commit log entry via RDMA_Send– Append log entry

MDS

Client

1

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

a

a

Client

Client

Client

MDS

Log: FileWriteAddr=(X,addr)Size=4096

Page 20: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

20

Operations in ORION File System

• Write(c)– Allocate & CoW to client-owned pages

– Append log entry

– Commit log entry via RDMA_Send– Append log entry

– Update tail pointers atomically

MDS

Client

1

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

a

aMDSClient

Client

Client

Client

MDS

Log: FileWriteAddr=(X,addr)Size=4096

Page 21: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

21

Operations in ORION File System

• Tailcheck(b)– Log commit from another client

MDS

Client

1

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

a

MDS

Page 22: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

22

Operations in ORION File System

• Tailcheck(b)– Log commit from another client

– RDMA_Read remote log tail

MDS

Client

1

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

a

MDS

Client

Page 23: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

23

Operations in ORION File System

• Tailcheck(b)– Log commit from another client

– RDMA_Read remote log tail

– Read from MDS if Len(Local) < Len(Remote)

MDS

Client

1

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

a

MDS

Client

Client

Page 24: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

24

3

Data

Operations in ORION File System

• Read(b)– Tailcheck (async)

– RDMA_Read from data store

• Data locality

– Future reads will hit DRAM cache

– Future writes will go to local NVMM

MDS

Client

1

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

a

Data Store

[Y]

Type=FileWriteAddr=(Y,3)Size=4096

Client

Client

Page 25: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

25

Type=FileWriteAddr=($,0)Size=4096

3

Data

Operations in ORION File System

• Read(b)– Tailcheck (async)

– RDMA_Read from data store

– In-place update to log entry

• Data locality

– Future reads will hit DRAM cache

– Future writes will go to local NVMM

MDS

Client

1

b

c

b

c

Data $

inode log

inode log

Data[X]

[$]

RecvQ

a

Data Store

[Y]

Client

Client

Client

Page 26: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

26

15Mop/s

Accelerating Metadata Accesses

• Observations:

– RDMA prefers inbound operations [Su, EuroSys 17]

– RDMA prefers small operations [Kalia, ATC 16]

• MDS request handling:

– Tailcheck (8B RDMA_Read): MDS-bypass

– Log Commit (~128B RDMA_Send): Single-inode operations

– RPC (Varies): Other operations, less common

1.9Mop/s

3.3 us

One-sided RDMA

RDMA Send

8 Bytes

512 Bytes

Read [Inbound]

Write[Outbound]

Latency

Throughtput

1.3 us

Page 27: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

27

#1

#1 #3 #3#3

#1 #2

#3#2#1#1

#3#1

Optimizing Log Commits

• Speculative log commit:

– Return when RDMA_Send verb is signaled

– Tailcheck before send

– Rebuild inode from log when necessary

– RPCs for complex operations (e.g.

O_APPEND)

• Log commit + Persist: ~ 500 CPU Cycles

Clie

nt

1M

DS

Clie

nt

2 inode log

RecvQ

inode log

inode log

MDS Tail

Local Tail

(memcpy) (flush+fence)

Rebuild

Page 28: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

28

Evaluation

ORION Prototype

• ORION kernel modules (~15K LOC)

• Linux Kernel 4.10

• RDMA Stack: MLNX_OFED 4.3

• Bind to 1 core for each client

Networking

• 12 Nodes connected to a switch

• InfiniBand Switch (QLogic 12300)

Hardware

• 2x Intel Westmere-EP CPU

• 16GB DRAM as DRAM

• 32GB DRAM as NVMM

• RNIC: Mellanox ConnectX-2 VPI (40Gbps)

Page 29: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

29

0

25

50

75

100

Create Mkdir Unlink Rmdir FIO 4K

Read

FIO 4K

Write

Acc

ess

Late

ncy

(us)

Orion vs. Distributed File Systems

Orion Ceph Gluster

Evaluation: File Operations

582/932 417

0

10

20

30

Create Mkdir Unlink Rmdir FIO 4K

Read

FIO 4K

Write

Orion vs. Local File Systems

Orion NOVA Ext4-DAX

Better Better

1.3x162x

3x

1.3x

Page 30: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

30

Evaluation: Applications

0

0.2

0.4

0.6

0.8

1

varmail fileserver webserver mongodb

Thro

ugh

pu

t (R

ela

tive

)

Filebench workloads and MongoDB (YCSB-A) throughput (relative to NOVA)

Orion NOVA Ext4-DAX Ceph GlusterBetter

68%

85%

Page 31: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

31

Evaluation: Metadata Accesses

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Thro

ugh

pu

t (R

ela

tive

to

1 C

lien

t)

# Clients

Relative throughput of metadata operations

Tailcheck Log Commit RPC0

5

10

15

8 Clients

Thro

ugh

pu

t (M

op

/s)

Throughput of metadata operations

Tailcheck Log Commit RPCBetter

Page 32: Orion: A Distributed File System for Non-Volatile Main ......1 ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz,

32

Conclusion

• Existing distributed file systems lack of NVMM support and have significant

software overhead

• ORION unifies the NVMM file system and the networking layer

• ORION provides fast metadata accesses

• ORION allows DAX to local NVMM data

• Performance comparable to local NVMM file systems