gpudirect: integrating the gpu with a network …

18
DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE

Upload: others

Post on 17-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

DAVIDE ROSSETTI, SW COMPUTE TEAM

GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE

Page 2: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

GPUDIRECT FAMILY1

GPUDirect Shared GPU-Sysmem for inter-node copy optimization

GPUDirect P2P for intra-node, accelerated GPU-GPU memcpy

GPUDirect P2P for intra-node, inter-GPU LD/ST access

GPUDirect RDMA2 for inter-node copy optimization

[1] developer info: https://developer.nvidia.com/gpudirect

[2] http://docs.nvidia.com/cuda/gpudirect-rdma

Page 3: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

GPUDIRECT RDMA CAPABILITIES & LIMITATIONS GPUDirect RDMA

direct HCA access to GPU memory

CPU still driving computing + communication

Fast CPU needed

Implications: power, latency, TCO

Risks: limited scaling …

Page 4: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

MOVING DATA AROUND D

ata

pla

ne

Control plane

GPUDirect RDMA GPU

CPU

IOH

HCA

CPU prepares and queues communication tasks on HCA CPU synchronizes with GPU tasks HCA directly accesses GPU memory

CPU

GPU

Page 5: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

GPU

MEET NEXT THING

Data

pla

ne

Control plane

GPUDirect Async CPU

IOH

CPU prepares and queues compute and communication tasks on GPU

GPU triggers communication on HCA

HCA directly accesses GPU memory

HCA

CPU GPU

GPU

GPUDirect RDMA

Page 6: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

CPU OFF THE CRITICAL PATH

CPU prepares work plan

hardly parallelizable, branch intensive

GPU orchestrates flow

Runs on optimized scheduling unit

Same one scheduling GPU work

Now also scheduling network communications

Page 7: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

KERNEL+SEND NORMAL FLOW

a_kernel<<<…,stream>>>(buf);

cudaStreamSynchronize(stream);

ibv_post_send(buf);

while (!done) ibv_poll_cq(txcq);

b_kernel<<<…,stream>>>(buf);

GPU CPU HCA

Page 8: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

KERNEL+SEND GPUDIRECT ASYNC

a_kernel<<<…,stream>>>(buf);

gds_stream_queue_send(stream,qp,buf);

gds_stream_wait_cq(stream,txcq);

b_kernel<<…,stream>>(buf);

GPU CPU HCA

CPU is free

Kernel launch latency is hidden

Page 9: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

RECEIVE+KERNEL NORMAL FLOW

while (!done) ibv_poll_cq();

a_kernel<<<…,stream>>>(buf);

cuStreamSynchronize(stream);

GPU CPU HCA

incoming message

GPU kernel execution triggered

Page 10: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

RECEIVE+KERNEL GPUDIRECT ASYNC

gds_stream_wait_cq(stream,rx_cq);

a_kernel<<<…,stream>>>(buf);

cuStreamSynchronize(stream);

GPU CPU HCA

kernel queued to GPU

incoming message

Kernel launch moved way

earlier latency is hidden!!!

CPU is idle deep sleep state!!!

GPU kernel execution triggered

Page 11: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

USE CASE SCENARIOS Performance mode (~ Top500)

enable batching

increase performance

CPU available, additional GFlops

Economy mode (~ Green500) enable GPU IRQ waiting mode

free more CPU cycles

Optionally slimmer CPU

Page 12: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

4096 8192 16384 32768 65536

Latency(us)

computebuffersize(Bytes)

RDMAonly

+AsyncTXonly

+Async

PERFORMANCE MODE

[*] modified ud_pingpong test: recv+GPU kernel+send on each side.

2 nodes: Ivy Bridge Xeon + K40 + Connect-IB + MLNX switch, 10000 iterations, message size: 128B, batch size: 20

40% faster

Page 13: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

0

10

20

30

40

50

60

70

80

2 4

Averagemeperitera

on

Numberofnodes

2Dstencilbenchmark

RDMAonly

+Async

2D STENCIL BENCHMARK weak scaling

256^2 local lattice

2x1, 2x2 node grids

1 GPU per node

27% faster 23% faster

[*] 4 nodes: Ivy Bridge Xeon + K40 + Connect-IB + MLNX switch

Page 14: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

%loadofsingleCPUcore

CPUu liza on

RDMAonly RDMAw/IRQ +Async

ECONOMY MODE

[*] modified ud_pingpong test, HW same as in previous slide

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

200.00

size=16384

RDMAonly 39.28

RDMAw/IRQ 178.62

+Async 29.46

latency(us)

Round-triplatency 25% faster

45% less CPU load

Page 15: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

SUMMARY

Meet Async, next generation of GPUDirect

GPU orchestrates network operations

CPU off the critical path

40% faster, 45% less CPU load

Excited about these topics ?

collaborations & jobs @NVIDIA

Page 16: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

NVIDIA REGISTERED DEVELOPER PROGRAMS Everything you need to develop with NVIDIA products

Membership is your first step in establishing a working relationship with NVIDIA Engineering

Exclusive access to pre-releases

Submit bugs and features requests

Stay informed about latest releases and training opportunities

Access to exclusive downloads

Exclusive activities and special offers

Interact with other developers in the NVIDIA Developer Forums

REGISTER FOR FREE AT: developer.nvidia.com

Page 17: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

THANK YOU

Page 18: GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK …

PERFORMANCE VS ECONOMY

[*] modified ud_pingpong test, HW same as in previous slide, NUMA binding to socket0/core0, SBIOS power-saving profile

Performance mode Economy mode