ofvwg: gpudirect and peerdirect - openfabrics alliance › ofv › ofv_presentation_gpu.pdf ·...

18
OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel and Davide Rossetti

Upload: others

Post on 07-Jun-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

OFVWG: GPUDirect and PeerDirect

Connecting co-processors to the fabric

Shachar Raindel and Davide Rossetti

Page 2: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

Outline

•  Revisit GPUDirect and PeerDirect family of technologies

•  Next level of peripheral integration – GPUDirect Async and PeerDirect Sync

•  Early benchmark results

Page 3: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

GPUDirect and PeerDirect family

•  GPUDirect Shared GPU-Sysmem for inter-node copy optimization

•  GPUDirect P2P for intra-node, between GPUs in the node

•  GPUDirect RDMA1 for inter-node copy optimization –  Based on PeerDirect technology from Mellanox –  Available in Mellanox OFED –  Submitted to upstream for review2

[1] http://docs.nvidia.com/cuda/gpudirect-rdma

[2] http://comments.gmane.org/gmane.linux.drivers.rdma/22093

Page 4: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

GPUDirect RDMA capabilities & limitations •  GPUDirect RDMA

–  direct HCA access to GPU memory

•  CPU still driving computing + communication –  Fast CPU needed –  Implications: power, latency, TCO –  Risks: limited scaling …

Page 5: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

Moving data around with GPUDirect

Dat

a pla

ne

Control plane

GPUDirect RDMA GPU$

CPU$

IOH$

HCA

CPU prepares and queues communication tasks on HCA CPU synchronizes with GPU tasks HCA directly accesses GPU memory

CPU

GPU

Page 6: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

GPU$

Accelerating the control plane D

ata

pla

ne

Control plane

GPUDirect Async CPU$

IOH$

CPU prepares and queues compute and communication tasks on GPU

GPU triggers communication on HCA HCA directly accesses GPU memory

HCA

CPU GPU

GPU

GPUDirect RDMA

Page 7: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

CPU off the critical path

•  CPU prepares work plan –  hardly parallelizable, branch intensive

•  GPU orchestrates flow –  Runs on optimized front-end unit –  Same one scheduling GPU work –  Now also scheduling network

communications

Page 8: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

Kernel+Send Normal flow a_kernel<<<…,stream>>>(buf); cudaStreamSynchronize(stream); ibv_post_send(buf); while (!done) ibv_poll_cq(txcq); b_kernel<<<…,stream>>>(buf);

GPU CPU HCA

Page 9: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

Kernel+Send GPUDirect Async

a_kernel<<<…,stream>>>(buf);

gds_stream_queue_send(stream,qp,buf);

gds_stream_wait_cq(stream,txcq);

b_kernel<<…,stream>>(buf); $

GPU CPU HCA

CPU is free

Kernel launch latency is hidden

Page 10: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

Receive+Kernel Normal flow

while$(!done)$ibv_poll_cq();$

a_kernel<<<…,stream>>>(buf);$cuStreamSynchronize(stream);$$

GPU CPU HCA

incoming message

GPU kernel execution triggered

Page 11: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

Receive+Kernel GPUDirect Async

gds_stream_wait_cq(stream,rx_cq);$

a_kernel<<<…,stream>>>(buf);$

cuStreamSynchronize(stream);$$

GPU CPU HCA

kernel queued to GPU

incoming message

Kernel launch moved way

earlier latency is hidden!!!

CPU is idle deep sleep state!!!

GPU kernel execution triggered

Page 12: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

Use case scenarios

Performance mode (~ Top500)

–  enable batching

–  increase performance

–  CPU available, additional GFlops

Economy mode (~ Green500)

–  enable GPU IRQ waiting mode

–  free more CPU cycles

–  Optionally slimmer CPU

Page 13: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

0.00#

5.00#

10.00#

15.00#

20.00#

25.00#

30.00#

35.00#

40.00#

45.00#

50.00#

4096# 8192# 16384# 32768# 65536#

Latency((us)(

compute(buffer(size((Bytes)(

RDMA#only#

+Async#TX#only#

+Async#

Performance Mode

[*] modified ud_pingpong test: recv+GPU kernel+send on each side. 2 nodes: Ivy Bridge Xeon + K40 + Connect-IB + MLNX switch, 10000 iterations, message size: 128B, batch size: 20

40% faster

Page 14: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

0"

10"

20"

30"

40"

50"

60"

70"

80"

2" 4"

Average'(m

e'pe

r'itera(o

n'

Number'of'nodes'

2D'stencil'benchmark'

RDMA"only"

+Async"

2D stencil benchmark

•  weak scaling

•  256^2 local lattice

•  2x1, 2x2 node grids

•  1 GPU per node

27% faster 23% faster

[*] 4 nodes: Ivy Bridge Xeon + K40 + Connect-IB + MLNX switch

Page 15: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

Economy Mode

0.00#

20.00#

40.00#

60.00#

80.00#

100.00#

120.00#

140.00#

160.00#

180.00#

200.00#

#size=16384#

RDMA#only# 39.28#

RDMA#w/IRQ# 178.62#

+Async# 29.46#

latency((us)(

Round0trip(latency(

0%#

10%#

20%#

30%#

40%#

50%#

60%#

70%#

80%#

90%#

100%#

%#load#of#single#CPU#core#

CPU$u&liza&on$

RDMA#only# RDMA#w/IRQ# +Async#

25% faster

45% less CPU load

[*] modified ud_pingpong test, HW same as in previous slide

Page 16: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

Summary

•  Meet Async, next generation of GPUDirect •  GPU orchestrates network operations •  CPU off the critical path •  40% faster, 45% less CPU load

Page 17: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

OFVWG

Thank You

Page 18: OFVWG: GPUDirect and PeerDirect - OpenFabrics Alliance › ofv › ofv_presentation_GPU.pdf · OFVWG: GPUDirect and PeerDirect Connecting co-processors to the fabric Shachar Raindel

Performance vs Economy

[*] modified ud_pingpong test, HW same as in previous slide, NUMA binding to socket0/core0, SBIOS power-saving profile

Performance mode Economy mode