spin - usenix · you are here acsl - technion 8 summary background observations objective spin...

32
Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs Shai Bergman | Tanya Brokhman | Tzachi Cohen | Mark Silberstein ACSL - Technion SPIN:

Upload: others

Post on 21-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUsShai Bergman | Tanya Brokhman | Tzachi Cohen | Mark Silberstein

ACSL - Technion

SPIN:

Page 2: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

2ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

Summary What do we do?

Enable efficient file I/O for GPUs

Why?

Support diverse I/O workloads involving GPUs

How?

Make P2P a first class citizen within the file I/O stack

Results

Better throughputStandard file API

cross-GPU portability

Page 3: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

3ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

Fast data transfers

Bounded by extra copy

Data resides in SSD

BackgroundCPU mediated data

transfers introduce extra latency with lower

throughput

CPUIO - CPU mediated transfer

PCIe

Page 4: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

4ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

Eliminates redundant copies

Not involved in data path

Better power efficiency

Higher throughput & lower latency

Background

[1] J. Zhang, D. Donofrio, J. Shalf, M. T. Kandemir, and M. Jung, “NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures,”

[2] H.-W. Tseng, Y. Liu, M. Gahagan, J. Li, Y. Jin, and S. Swanson, “Gullfoss: Accelerating and Simplifying Data Movement Among Heterogeneous Computing and Storage Resources,”

[3] M. Shihab, K. Taht, and M. Jung, “GPUDrive: Reconsidering Storage Accesses for GPU Acceleration,”[4] H.-W. Tseng, Q. Zhao, Y. Zhou, M. Gahagan, and S. Swanson, “Morpheus: creating application objects efficiently for heterogeneous

computing,”[5]“Project Donard.” https://github.com/sbates130272/donard, 2015.

GPU vendors support P2P

&

PCIe

Page 5: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

5ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

0.00

0.50

1.00

1.50

2.00

2.50

3.00

512B53

4K262

16k472

32k675

64K846

512K2570

1M2692

Spee

du

p

Block size

CPUIO P2P

Observations

P2P is not a silver bullet…

P2P is great!*

* But wait – only for certain data access patterns

33x 7.6x

Sequential reads (e.g. grep)

P2P is not a silver bullet!

**data is not preloaded to the page cache

4.2x

CPUIO - CPU mediated transfer

Large sequential reads: P2P ~1.4x SpeedupShort sequential reads: P2P ~33x Slower than CPUIO?

Page 6: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

6ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

Hard to utilize

Non-standard API | No misaligned accesses | LVM/MDADM incompatible

No Page Cache Integration

No read ahead | Cannot utilize P$ for data reuse

ObservationsWhat went wrong?

P2P bypasses the kernel!

No file consistency

Can read stale data | Requires explicit flushes to SSD

Page 7: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

7ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

ObjectiveWhat do we want?

Regular file I/O to GPU memory

int fd;

...

//open file

...

pread64(fd,gpu_buffer,20*1024,0);

...

Page 8: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

8ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Contributions

Combine Page Cache and P2PInterleave system memory and SSD when possible

GPU Read AheadActivate read ahead mechanism when determined beneficial. Nested page cache within CPU memory for GPU Use

Standard File API• Underlying block device

support (RAID, LVM)

Data Consistency + POSIX file semanticsKeep POSIX file semantics + data consistency, even when CPU + GPU work on the same file

• Activate P2P when beneficial

Page 9: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

9ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

NVMe SSD

SPIN

P-routerP-Read Ahead policy

P-cache checker

P2PDMA

VFSPage cache

BlockLayer

NVMeDriver

GPU

GPU buffer

GPU addr. extraction

Current FS API

file

2.a 2.b

3.b

3.a

4.b

4.a

PCIe

P2P

DM

A t

ran

sfe

r

5.a

fd GPU buffer

1

pread( , )

5.b

GPURA

CPU PC

SPIN:High level overview

RQ

From Compatible SSD?Destined to GPU?Part of a sequential read?Data resides in page cache?

Page 10: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

10ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

NVMe SSD

SPIN

P-routerP-Read Ahead policy

P-cache checker

P2PDMA

VFSPage cache

BlockLayer

NVMeDriver

GPU

GPU buffer

GPU addr. extraction

Current FS API

file

2.a 2.b

3.b

3.a

4.b

4.a

PCIe

P2P

DM

A t

ran

sfe

r

5.a

fd GPU buffer

1

pread( , )

5.b

GPURA

CPU PC

SPIN:High level overview RQ

Page 11: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

11ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

NVMe SSD

SPIN

P-routerP-Read Ahead policy

P-cache checker

P2PDMA

VFSPage cache

BlockLayer

NVMeDriver

GPU

GPU buffer

GPU addr. extraction

Current FS API

file

2.a 2.b

3.b

3.a

4.b

4.a

PCIe

P2P

DM

A t

ran

sfe

r

5.a

fd GPU buffer

1

pread( , )

5.b

GPURA

CPU PC

SPIN:High level overview

RQ

Page 12: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

12ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

NVMe SSD

SPIN

P-routerP-Read Ahead policy

P-cache checker

P2PDMA

VFSPage cache

BlockLayer

NVMeDriver

GPU

GPU buffer

GPU addr. extraction

Current FS API

file

2.a 2.b

3.b

3.a

4.b

4.a

PCIe

P2P

DM

A t

ran

sfe

r

5.a

fd GPU buffer

1

pread( , )

5.b

GPURA

CPU PC

SPIN:High level overview RQ

Page 13: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

13ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

NVMe SSD

SPIN

P-routerP-Read Ahead policy

P-cache checker

P2PDMA

VFSPage cache

BlockLayer

NVMeDriver

GPU

GPU buffer

GPU addr. extraction

Current FS API

file

2.a 2.b

3.b

3.a

4.b

4.a

PCIe

P2P

DM

A t

ran

sfe

r

5.a

fd GPU buffer

1

pread( , )

5.b

GPURA

CPU PC

SPIN:High level overview

RQ

Page 14: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

14ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

NVMe SSD

SPIN

P-routerP-Read Ahead policy

P-cache checker

P2PDMA

VFSPage cache

BlockLayer

NVMeDriver

GPU

GPU buffer

GPU addr. extraction

Current FS API

file

2.a 2.b

3.b

3.a

4.b

4.a

PCIe

P2P

DM

A t

ran

sfe

r

5.a

fd GPU buffer

1

pread( , )

5.b

GPURA

CPU PC

SPIN:High level overview RQRQ

Page 15: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

15ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

NVMe SSD

SPIN

P-routerP-Read Ahead policy

P-cache checker

P2PDMA

VFSPage cache

BlockLayer

NVMeDriver

GPU

GPU buffer

GPU addr. extraction

Current FS API

file

2.a 2.b

3.b

3.a

4.b

4.a

PCIe

P2P

DM

A t

ran

sfe

r

5.a

fd GPU buffer

1

pread( , )

5.b

GPURA

CPU PC

SPIN:High level overview

P-cache checkerP-cache checker

Page 16: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

16ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Combine page cache and P2P

Sometimes the requested data resides in the P$

e.g due to previous usage of the data by CPU

0

500

1000

1500

2000

2500

0 500 1000 1500 2000Tr

ansf

er T

ime

[use

c]

Request Size [KiB]

P2P

P$

Transferring data from P$ is faster!

Transfer time from P$ and SSD P2P vs request size

Lower is better

Page 17: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

17ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Combine page cache and P2P

pread64(fd,gpu_dest,5*4096,0); //5 pages of 4KiB

Page cache is empty:

PCIe

Memory Bus

TransferP2P!

P-cache checker

Page Cache

EMPTY

Page #0 Page #1 Page #2 Page #3 Page #4

Page 18: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

18ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

Page #0 Page #1 Page #2 Page #3 Page #4

SPIN:Combine page cache and P2P

pread64(fd,gpu_destk,5*4096,0); //5 pages of 4KiB

All pread64 contents in P$:

PCIe

Memory Bus

Transferfrom memory (CPUIO)

P-cache checker

Page #0 Page #1 Page #2 Page #3 Page #4

Page Cache

EMPTY

Page Cache

Page 19: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

19ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

Page #0 Page #1 Page #2 Page #3 Page #4

SPIN:Combine page cache and P2P

pread64(fd,gpu_dest,5*4096,0); //5 pages of 4KiB

Some of the data is in P$?

PCIe

Memory Bus

Page #0 Page #1 Page #2 Page #3 Page #4

Page Cache

EMPTY

Page Cache

?

?#p1#p3

#p0#p2#p4

Fine grained interleaving is a bad idea!

Page resides in SSD only

Page resides in SSD and P$

Page 20: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

20ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

Page #0 Page #1 Page #2 Page #3 Page #4

SPIN:Combine page cache and P2P

pread64(fd,gpu_dest,5*4096,0); //5 pages of 4KiB

Page #0 Page #1 Page #2 Page #3 Page #4

Page resides in SSD only

Page resides in SSD and P$

3 transfers of 4KiB via P2P: 120.3us

Single transfer of 20KiB via P2P: 74.3us

40.1us 40.1us 40.1us

Page 21: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

21ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Combine page cache and P2P SSDs:

- Short IO requests are less efficient (low parallelism)- Invocation overhead per request

Fine grained interleaving = poor performance!

Optimization Problem: Find the transfer schedule to minimize transfer time

Page 22: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

22ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

0

500

1000

1500

2000

2500

32 512 992 1472 1952

Tran

sfer

Tim

e [s

ec]

Request Size [KiB]

P2P

P$

15361024 204851232

SPIN:Combine page cache and P2P

We model the SSD and RAM performance characteristics:- Assume P2P transfer time as piece-wise linear- Assume RAM transfer time as linear

To solve the problem & get an optimal schedule we need:

𝑇𝑝2𝑝 𝑠 - P2P transfer time for a given request size

𝑇𝑃$ 𝑠 - P$ transfer time for a given request size

Page 23: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

23ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Combine page cache and P2P

Solution is polynomial in number of blocks

We apply a greedy heuristic:- Examine every 3 consecutive data chunks

Costly to calculate for every transfer

Chunk #n Chunk #n+1 Chunk #n+2

Page resides in SSD only

Page resides in SSD and P$

𝑇𝑝2𝑝 𝑛 + 𝑛 + 1 + |𝑛 + 2|

𝑣𝑠.𝑇𝑝2𝑝 |𝑛| + 𝑇𝑃$ |𝑛 + 1| + 𝑇𝑝2𝑝 |𝑛 + 2|

Calculate:

Page 24: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

24ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Combine page cache and P2P

Solution is polynomial in number of blocks

We apply a greedy heuristic:- Examine every 3 consecutive data chunks

Costly to calculate for every transfer

Chunk #n Chunk #n+1 Chunk #n+2

Page resides in SSD only

Page resides in SSD and P$

𝑇𝑝2𝑝 𝑛 + 𝑛 + 1 + |𝑛 + 2|

𝑣𝑠.𝑇𝑝2𝑝 |𝑛| + 𝑇𝑃$ |𝑛 + 1| + 𝑇𝑝2𝑝 |𝑛 + 2|

Calculate:

Greedy Heuristic is only 1.6% slower than optimal

scheduling

Page 25: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

25ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Implementation:P2P & P$ Transfers

NVMe SSD

SPIN

P-routerP-Read Ahead policy

P-cache checker

P2PDMA

VFSPage cache

BlockLayer

NVMeDriver

GPU

GPU buffer

GPU addr. extraction

Current FS API

file

2.a 2.b

3.b

3.a

4.b

4.a

PCIe

P2P

DM

A t

ran

sfe

r

5.a

fd GPU buffer

1

pread( , )

5.b

GPURA

CPU PC

P2P: Address tunneling mechanism

P$: Memcpy from P$ to GPU mapped memory

Page 26: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

26ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Implementation& Evaluation

SPIN is implemented as a kernel module, patched

NVME module & an LD_PRELOAD library

No kernel modifications are required

System Specs:

- Intel P3700 NVME SSD

- AMD Radeon R9 Fury & NVIDIA Tesla K40c

- Ubuntu + Linux kernel 3.19

- Intel Core i7-5930K (6 Phys Cores) & X99 Chipset

- 24GB DDR4 RAM

Page 27: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

27ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Implementation& Evaluation

We have evaluated the following:

Threaded IO (TIOtest) Benchmark (1-4 threads):

- Sequential reads (including software RAID)

- Random reads/writes

- Effects of P$ residency on read throughput

- Effects CPU & I/O stress on read throughput

Application Benchmarks

- Aerial imagery rendering

- GPU accelerated log server

- Image collage utilizing GPUFS

Page 28: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

28ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Implementation& Evaluation

Effect of P$ on Read Throughput

Potential performance gains for producer-consumer workloads

0

20

40

60

80

100

120

07

108

208

3010

4011

5014

6017

7022

8032

9058

100239

Rel

ativ

e th

rou

ghp

ut

%

% of file in page cache

SPIN P2PDMA CPUIO

Thpt [MB/sec]:

*512B reads

No data in P$, less than 5%

overhead

All data in P$, less than 5%

overhead

Page 29: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

29ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Implementation& Evaluation

GPU Accelerated Log Server

- Store a log into SSD

- Analyze log using GPU acceleration for string matching

- Similar to fail2ban

Real time configuration:- Log arrives to server- Server stores logs in SSD- GPU analyzes logs by reading file

Offline configuration:- Log is already in SSD

- GPU analyzes logs by reading file

Page 30: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

30ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Implementation& Evaluation

GPU Accelerated Log Server

- Store a log into SSD

- Analyze log using GPU acceleration for string matching

- Similar to fail2ban

Real time configuration:- Log arrives to server- Server stores logs in SSD- GPU analyzes logs by reading file

Offline configuration:- Log is already in SSD

- GPU analyzes logs by reading file

We want our application to work efficiently in any

configuration

Page 31: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

31ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Implementation& Evaluation

GPU Accelerated Log Server

Real Time configuration

0

500

1000

1500

2000

2500

P2P CPUIO SPIN

BW

[M

BP

S]

Transfer Mechanism

Offline configuration

0

500

1000

1500

2000

2500

3000

P2P CPUIO SPIN

BW

[M

BP

S]

Transfer Mechanism

Data resides in p$ and SSDSPIN reads data from P$

Data resides in SSD onlySPIN utilizes P2P

I/O

Co

nte

nti

on

Ad

dit

ion

al “

ho

p”

Page 32: SPIN - USENIX · You are here ACSL - Technion 8 Summary Background Observations Objective SPIN Conclusion SPIN: Contributions Combine Page Cache and P2P Interleave system memory and

32ACSL - TechnionYou are here

Summary

Background

Observations

Objective

SPIN

Conclusion

SPIN:Conclusion

Thank you!

[email protected] github.com/acsl-technion/spin

- SPIN seamlessly integrates P2P as a first class citizen

into the file I/O stack

- SPIN utilizes several mechanisms to speed up data

transfers transparently

- With SPIN, the same code performs well under all

setups