mvapich2: a high performance mpi library | gtc 2013...• high performance open-source mpi library...
TRANSCRIPT
MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Presentation at GTC 2013
by
• Growth of High Performance Computing
(HPC)
– Growth in processor performance
• Chip density doubles every 18 months
– Growth in commodity networking
• Increase in speed/features + reducing cost
– Growth in accelerators (NVIDIA GPUs)
Current and Next Generation HPC Systems and Applications
2 DK-OSU-GTC '13
3
Trends for Commodity Computing Clusters in the Top 500 Supercomputer List (http://www.top500.org)
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
Pe
rce
nta
ge o
f C
lust
ers
Nu
mb
er
of
Clu
ste
rs
Timeline
Percentage of Clusters
Number of Clusters
DK-OSU-GTC '13
• 224 IB Clusters (44.8%) in the November 2012 Top500 list
(http://www.top500.org)
• Installations in the Top 20 (9 systems), Two use NVIDIA GPUs
Large-scale InfiniBand Installations
147, 456 cores (Super MUC) in Germany (6th) 125,980 cores (Pleiades) at NASA/Ames (14th)
204,900 cores (Stampede) at TACC (7th) 70,560 cores (Helios) at Japan/IFERC (15th)
77,184 cores (Curie thin nodes) at France/CEA (11th) 73,278 cores (Tsubame 2.0) at Japan/GSIC (17th)
120, 640 cores (Nebulae) at China/NSCS (12th) 138,368 cores (Tera-100) at France/CEA (20th)
72,288 cores (Yellowstone) at NCAR (13th)
4
• 54 of the InfiniBand clusters (in TOP500) house accelerators/co-
processors and 42 of them have NVIDIA GPUs
DK-OSU-GTC '13
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU
– Internode Communication
• Point-to-point Communication
• Collective Communication
• MPI Datatype processing
• Using GPUDirect RDMA
– Multi-GPU Configurations
• MPI and OpenACC
• Conclusion
5
Outline
DK-OSU-GTC '13
• Many systems today want to use systems
that have both GPUs and high-speed
networks such as InfiniBand
• Problem: Lack of a common memory
registration mechanism
– Each device has to pin the host memory it will
use
– Many operating systems do not allow
multiple devices to register the same
memory pages
• Previous solution:
– Use different buffer for each device and copy
data DK-OSU-GTC '13 6
InfiniBand + GPU systems (Past)
• Collaboration between Mellanox and
NVIDIA to converge on one memory
registration technique
• Both devices register a common
host buffer
– GPU copies data to this buffer, and the network
adapter can directly read from this buffer (or
vice-versa)
• Note that GPU-Direct does not allow you to
bypass host memory
7
GPU-Direct
DK-OSU-GTC '13
PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(sbuf, sdev, . . .);
MPI_Send(sbuf, size, . . .);
At Receiver:
MPI_Recv(rbuf, size, . . .);
cudaMemcpy(rdev, rbuf, . . .);
Sample Code - Without MPI integration
• Naïve implementation with standard MPI and CUDA
• High Productivity and Poor Performance
8 DK-OSU-GTC '13
PCIe
GPU
CPU
NIC
Switch
At Sender: for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(sbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
Sample Code – User Optimized Code
• Pipelining at user level with non-blocking MPI and CUDA interfaces
• Code at Sender side (and repeated at Receiver side)
• User-level copying may not match with internal MPI design
• High Performance and Poor Productivity
9 DK-OSU-GTC '13
Can this be done within MPI Library?
• Support GPU to GPU communication through standard MPI
interfaces
– e.g. enable MPI_Send, MPI_Recv from/to GPU memory
• Provide high performance without exposing low level details
to the programmer
– Pipelined data transfer which automatically provides optimizations
inside MPI library without user tuning
• A new design was incorporated in MVAPICH2 to support this
functionality
10 DK-OSU-GTC '13
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and
RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in
70 countries
– More than 160,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 7th ranked 204,900-core cluster (Stampede) at TACC
• 14th ranked 125,980-core cluster (Pleiades) at NASA
• 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology
• 75th ranked 16,896-core cluster (Keenland) at GaTech
• and many others
– Available with software stacks of many IB, HSE and server vendors
including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
11
MVAPICH2/MVAPICH2-X Software
DK-OSU-GTC '13
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU
– Internode Communication
• Point-to-point Communication
• Collective Communication
• MPI Datatype processing
• Using GPUDirect RDMA
– Multi-GPU Configurations
• MPI and OpenACC
• Conclusion
12
Outline
DK-OSU-GTC '13
At Sender:
At Receiver:
MPI_Recv(r_device, size, …);
inside MVAPICH2
Sample Code – MVAPICH2-GPU
• MVAPICH2-GPU: standard MPI interfaces used
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
• High Performance and High Productivity
13
MPI_Send(s_device, size, …);
DK-OSU-GTC '13
MPI Two-sided Communication
• 45% improvement compared with a naïve user-level implementation
(Memcpy+Send), for 4MB messages
• 24% improvement compared with an advanced user-level implementation
(MemcpyAsync+Isend), for 4MB messages
0
500
1000
1500
2000
2500
3000
32K 64K 128K 256K 512K 1M 2M 4M
Tim
e (
us)
Message Size (bytes)
Memcpy+Send
MemcpyAsync+Isend
MVAPICH2-GPU
H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11
14 DK-OSU-GTC '13
Better
15
Application-Level Evaluation (LBM and AWP-ODC)
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 1D LBM-CUDA: one process/GPU per node, 16 nodes, 4 groups data grid
•AWP-ODC (Courtesy: Yifeng Cui, SDSC) • A seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
0
20
40
60
80
100
120
140
256*256*256 512*256*256 512*512*256 512*512*512
Ste
p T
ime
(S)
Domain Size X*Y*Z
MPI MPI-GPU
13.7% 12.0%
11.8%
9.4%
1D LBM-CUDA
DK-OSU-GTC '13
0 10 20 30 40 50 60 70 80 90
1 GPU/Proc per Node 2 GPUs/Procs per Node Tota
l Ex
ecu
tio
n T
ime
(se
c)
Configuration
AWP-ODC
MPI MPI-GPU
11.1% 7.9%
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU
– Internode Communication
• Point-to-point Communication
• Collective Communication
• MPI Datatype processing
• Using GPUDirect RDMA
– Multi-GPU Configurations
• MPI and OpenACC
• Conclusion
16
Outline
DK-OSU-GTC '13
17
MPI_Alltoall
P2P Comm.
P2P Comm.
P2P Comm.
P2P Comm.
DMA: data movement from device
to host
RDMA: Data transfer to
remote node over network
DMA: data movement
from host to device
N2
Pipelined point-to-point communication
optimizes this
Need for optimization at the algorithm level
Optimizing Collective Communication
DK-OSU-GTC '13
0
2000
4000
6000
8000
10000
12000
14000
16000
128K 256K 512K 1M 2M
Tim
e (
us)
Message Size
No MPI Level Optimization
Collective Level Optimization
18
46%
Alltoall Latency Performance (Large Messages)
DK-OSU-GTC '13
A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011
• 8 node Westmere cluster with NVIDIA Tesla C2050 and IB QDR
Better
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU
– Internode Communication
• Point-to-point Communication
• Collective Communication
• MPI Datatype Processing
• Using GPUDirect RDMA
– Multi-GPU Configurations
• MPI and OpenACC
• Conclusion
19
Outline
DK-OSU-GTC '13
Non-contiguous Data Exchange
20
• Multi-dimensional data
– Row based organization
– Contiguous on one dimension
– Non-contiguous on other
dimensions
• Halo data exchange
– Duplicate the boundary
– Exchange the boundary in each
iteration
Halo data exchange
DK-OSU-GTC '13
Datatype Support in MPI
• Native datatypes support in MPI
– Operate on customized datatypes to improve productivity
– Enable MPI library to optimize non-contiguous data
DK-OSU-GTC '13 21
At Sender:
MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);
MPI_Type_commit(&new_type);
…
MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);
• What will happen if the non-contiguous data is in the GPU device memory?
• Enhanced MVAPICH2 • Use data-type specific CUDA Kernels to pack data in chunks
• Pipeline pack/unpack, CUDA copies, and RDMA transfers
H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 2011
22
Application-Level Evaluation (LBMGPU-3D)
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
0
50
100
150
200
250
300
350
400
8 16 32 64
Tota
l Exe
cuti
on
Tim
e (
sec)
Number of GPUs
MPI MPI-GPU 5.6%
8.2%
13.5% 15.5%
3D LBM-CUDA
DK-OSU-GTC '13
MVAPICH2 1.8 and 1.9 Series
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from GPU device buffers
23 DK-OSU-GTC '13
OSU MPI Micro-Benchmarks (OMB) 3.5 – 3.9 Releases
• A comprehensive suite of benchmarks to compare performance
of different MPI stacks and networks
• Enhancements to measure MPI performance on GPU clusters
– Latency, Bandwidth, Bi-directional Bandwidth
• Flexible selection of data movement between CPU(H) and
GPU(D): D->D, D->H and H->D
• Extensions with OpenACC is added in 3.9 Release
• Available from http://mvapich.cse.ohio-state.edu/benchmarks
• Available in an integrated manner with MVAPICH2 stack
24 DK-OSU-GTC '13
• D. Bureddy, H. Wang, A. Venkatesh, S. Potluri and D. K. Panda, OMB-GPU: A Micro-benchmark suite for Evaluating MPI Libraries on GPU Clusters, EuroMPI 2012, September 2012.
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU
– Internode Communication
• Point-to-point Communication
• Collective Communication
• MPI Datatype Processing
• Using GPUDirect RDMA
– Multi-GPU Configurations
• MPI and OpenACC
• Conclusion
25
Outline
DK-OSU-GTC '13
• Fastest possible communication
between GPU and other PCI-E
devices
• Network adapter can directly
read/write data from/to GPU
device memory
• Avoids copies through the host
• Allows for better asynchronous
communication
GPU-Direct RDMA with CUDA 5.0
InfiniBand
GPU
GPU Memory
CPU
Chip set
System Memory
DK-OSU-GTC '13 26
• Preliminary driver for GPU-Direct is under work by NVIDIA
and Mellanox
• OSU has done an initial design of MVAPICH2 with the latest
GPU-Direct-RDMA Driver
27
Initial Design of MVAPICH2 with GPU-Direct-RDMA
DK-OSU-GTC '13
• Performance evaluation has been carried out on four
platform configurations:
– Sandy Bridge, IB FDR, K20C
– WestmereEP, IB FDR, K20C
– Sandy Bridge, IB QDR, K20C
– WestmereEP, IB QDR, K20C
28
Preliminary Performance Evaluation of OSU-MVAPICH2 with GPU-Direct-RDMA
DK-OSU-GTC '13
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Latency - Sandy Bridge + K20C + IB FDR
0
5
10
15
20
25
30
1 4 16 64 256 1K 4K
MV2
MV2-GDR-Hybrid
Small Message Latency
Message Size (Bytes)
Late
ncy
(u
s)
6.54
69%
0
200
400
600
800
1000
1200
8K 32K 128K 512K 2M
MV2
MV2-GDR-Hybrid
Large Message Latency
Message Size (Bytes)
Late
ncy
(u
s)
Message Size (Bytes)
Late
ncy
(u
s)
DK-OSU-GTC '13 29
20.92
Better
Better
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Uni-directional Bandwidth - Sandy Bridge + K20C + IB FDR
0
100
200
300
400
500
600
700
800
1 4 16 64 256 1K 4K
MV2
MV2-GDR-Hybrid
Small Message Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
0
1000
2000
3000
4000
5000
6000
7000
8K 32K 128K 512K 2M
MV2
MV2-GDR-Hybrid
Large Message Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
DK-OSU-GTC '13 30
3x
26%
Bet
ter
Bet
ter
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Bi-directional Bandwidth - Sandy Bridge + K20C + IB FDR
0
200
400
600
800
1000
1200
1 4 16 64 256 1K 4K
MV2
MV2-GDR-Hybrid
Small Message Bi-Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
0
2000
4000
6000
8000
10000
12000
8K 32K 128K 512K 2M
MV2
MV2-GDR-Hybrid
Large Message Bi-Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
DK-OSU-GTC '13 31
Bet
ter
Bet
ter
3.25x
31%
• Performance evaluation has been carried out on four
platform configurations:
– Sandy Bridge, IB FDR, K20C
– WestmereEP, IB FDR, K20C
– Sandy Bridge, IB QDR, K20C
– WestmereEP, IB QDR, K20C
32
Preliminary Performance Evaluation of OSU-MVAPICH2 with GPU-Direct-RDMA
DK-OSU-GTC '13
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9b WestmereEP (E5645) node with 12 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Latency - WestmereEP + K20C + IB FDR
0
5
10
15
20
25
30
35
1 4 16 64 256 1K 4K
MV2
MV2-GDR-Hybrid
Small Message Latency
Message Size (Bytes)
Late
ncy
(u
s)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
8K 32K 128K 512K 2M
MV2
MV2-GDR-Hybrid
Large Message Latency
Message Size (Bytes)
Late
ncy
(u
s)
DK-OSU-GTC '13 33
9.93
24.92
Better
Better 60%
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
GPU-GPU Internode MPI Uni-directional Bandwidth - WestmereEP + K20C + IB FDR
0
200
400
600
800
1000
1200
1400
1 4 16 64 256 1K 4K
MV2
MV2-GDR-Hybrid
Small Message Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
Based on MVAPICH2-1.9b WestmereEP (E5645) node with 12 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
0
500
1000
1500
2000
2500
3000
3500
8K 32K 128K 512K 2M
MV2
MV2-GDR-Hybrid
Large Message Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
DK-OSU-GTC '13 34
5.8x
30%
Bet
ter
Bet
ter
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
GPU-GPU Internode MPI Bi-directional Bandwidth - WestmereEP + K20C + IB FDR
Based on MVAPICH2-1.9b WestmereEP (E5645) node with 12 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
0
200
400
600
800
1000
1200
1400
1600
1 4 16 64 256 1K 4K
MV2
MV2-GDR-Hybrid
Small Message Bi-Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
0
1000
2000
3000
4000
5000
6000
8K 32K 128K 512K 2M
MV2
MV2-GDR-Hybrid
Large Message Bi-Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
DK-OSU-GTC '13 35
Bet
ter
Bet
ter
6x
90%
• Performance evaluation has been carried out on four
platform configurations:
– Sandy Bridge, IB FDR, K20C
– WestmereEP, IB FDR, K20C
– Sandy Bridge, IB QDR, K20C
– WestmereEP, IB QDR, K20C
36
Preliminary Performance Evaluation of OSU-MVAPICH2 with GPU-Direct-RDMA
DK-OSU-GTC '13
37
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-2 QDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Latency - Sandy Bridge + K20C + IB QDR
0
5
10
15
20
25
30
1 4 16 64 256 1K 4K
MV2
MV2-GDR-Hybrid
Small Message Latency
Message Size (Bytes)
Late
ncy
(u
s)
Message Size (Bytes)
Late
ncy
(u
s)
0
200
400
600
800
1000
1200
1400
1600
8K 32K 128K 512K 2M
MV2
MV2-GDR-Hybrid
Large Message Latency
Message Size (Bytes)
Late
ncy
(u
s)
Message Size (Bytes)
Late
ncy
(u
s)
DK-OSU-GTC '13
6.85
21.46
Better
Better
68%
38
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-2 QDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Uni-directional Bandwidth - Sandy Bridge + K20C + IB QDR
0
100
200
300
400
500
600
700
800
1 4 16 64 256 1K 4K
MV2
MV2-GDR-Hybrid
Small Message Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
0
500
1000
1500
2000
2500
3000
3500
8K 32K 128K 512K 2M
MV2
MV2-GDR-Hybrid
Large Message Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
DK-OSU-GTC '13
Bet
ter
Bet
ter
3x 40%
• Performance evaluation has been carried out on four
platform configurations:
– Sandy Bridge, IB FDR, K20C
– WestmereEP, IB FDR, K20C
– Sandy Bridge, IB QDR, K20C
– WestmereEP, IB QDR, K20C
39
Preliminary Performance Evaluation of OSU-MVAPICH2 with GPU-Direct-RDMA
DK-OSU-GTC '13
40
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9b WestmereEP (E5645) node with 12 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-2 QDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Latency - WestmereEP + K20C + IB QDR
0
5
10
15
20
25
30
35
1 4 16 64 256 1K 4K
MV2
MV2-GDR-Hybrid
Small Message Latency
Message Size (Bytes)
Late
ncy
(u
s)
0
200
400
600
800
1000
1200
1400
1600
1800
8K 32K 128K 512K 2M
MV2
MV2-GDR-Hybrid
Large Message Latency
Message Size (Bytes)
Late
ncy
(u
s)
DK-OSU-GTC '13
Better
Better
10.43
25.75
60%
41
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
GPU-GPU Internode MPI Uni-directional Bandwidth - WestmereEP + K20C + IB QDR
Based on MVAPICH2-1.9b WestmereEP (E5645) node with 12 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-2 QDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
0
200
400
600
800
1000
1200
1400
1 4 16 64 256 1K 4K
MV2
MV2-GDR-Hybrid
Small Message Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
0
500
1000
1500
2000
2500
3000
3500
8K 32K 128K 512K 2M
MV2
MV2-GDR-Hybrid
Large Message Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/S
)
DK-OSU-GTC '13
Bet
ter
Bet
ter
6x
30%
MVAPICH2 Release with GPUDirect RDMA Hybrid
• Further tuning and optimizations (such as collectives) to be done
• GPUDirect RDMA support in Open Fabrics Enterprise
Distribution (OFED) is expected during Q2 ‘13 (according to
Mellanox)
• MVAPICH2 release with GPUDirect RDMA support will be timed
accordingly
42 DK-OSU-GTC '13
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU
– Internode Communication
• Point-to-point Communication
• Collective Communication
• MPI Datatype Processing
• Using GPUDirect RDMA
– Multi-GPU Configurations
• MPI and OpenACC
• Conclusion
43
Outline
DK-OSU-GTC '13
Multi-GPU Configurations
44
CPU
GPU 1 GPU 0
Memory
I/O Hub
Process 0 Process 1 • Multi-GPU node architectures are becoming common
• Until CUDA 3.2
– Communication between processes staged through the host
– Shared Memory (pipelined)
– Network Loopback [asynchronous)
• CUDA 4.0
– Inter-Process Communication (IPC)
– Host bypass
– Handled by a DMA Engine
– Low latency and Asynchronous
– Requires creation, exchange and mapping of memory handles - overhead
HCA
DK-OSU-GTC '13
45
Comparison of Costs
Co
py
Late
ncy
(u
sec)
50
100
150
200
CUDA IPC Copy
Copy Via Host
CUDA IPC Copy + Handle Creation &
Mapping Overhead
49 usec
3 usec
228 usec
• Comparison of bare copy costs between two processes on one node, each using a different GPU (outside MPI)
• MVAPICH2 takes advantage of CUDA IPC while hiding the handle creation and mapping overheads from the user
DK-OSU-GTC '13
0
1000
2000
3000
4000
5000
6000
1 16 256 4K 64K 1M
Ban
dw
idth
(M
Bp
s)
Message Size (Bytes)
0
500
1000
1500
2000
4K 16K 64K 256K 1M 4M
Late
ncy
(u
sec)
Message Size (Bytes)
0
10
20
30
40
50
1 4 16 64 256 1K
Late
ncy
(u
sec)
Message Size (Bytes)
46
Two-sided Communication Performance
70% 46%
78% 0.0
10.0
20.0
30.0
40.0
1 4 16 64 256 1024
Late
ncy
(usec)
Message Size (Bytes)
SHARED-MEM CUDA IPC
DK-OSU-GTC '13
Already available in MVAPICH2 1.8 and 1.9
0
1000
2000
3000
4000
5000
6000
1 16 256 4K 64K 1M
Ban
dw
idth
(M
Bp
s)
Message Size (Bytes)
0
10
20
30
40
50
2 8 32 128 512
Late
ncy
(u
sec)
Message Size (Bytes)
47
One-sided Communication Performance (get + active synchronization vs. send/recv)
0
5
10
15
20
1 4 16 64 256 1K
Late
ncy
(usec)
Message Size (Bytes)
SHARED-MEM-1SC CUDA-IPC-1SC CUDA-IPC-2SC
30%
27%
one-sided semantics harness better performance compared to two-sided semantics.
0
500
1000
1500
2000
4K 16K 64K 256K 1M 4M
Late
ncy
(u
sec)
Message Size (Bytes)
DK-OSU-GTC '13
Support for one-sided communication from GPUs will be available in future releases of MVAPICH2
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU
– Internode Communication
• Point-to-point Communication
• Collective Communication
• MPI Datatype Processing
• Using GPUDirect RDMA
– Multi-GPU Configurations
• MPI and OpenACC
• Conclusion
48
Outline
DK-OSU-GTC '13
• OpenACC is gaining popularity
• Several sessions during GTC
• A set of compiler directives (#pragma)
• Offload specific loops or parallelizable sections in code onto accelerators #pragma acc region {
for(i = 0; i < size; i++) {
A[i] = B[i] + C[i];
} }
• Routines to allocate/free memory on accelerators buffer = acc_malloc(MYBUFSIZE); acc_free(buffer);
• Supported for C, C++ and Fortran
• Huge list of modifiers – copy, copyout, private, independent, etc..
OpenACC
49 DK-OSU-GTC '13
• acc_malloc to allocate device memory – No changes to MPI calls
– MVAPICH2 detects the device pointer and optimizes data movement
– Delivers the same performance as with CUDA
Using MVAPICH2 with OpenACC 1.0
50
A = acc_malloc(sizeof(int) * N);
…...
#pragma acc parallel loop deviceptr(A) . . .
//compute for loop
MPI_Send (A, N, MPI_INT, 0, 1, MPI_COMM_WORLD);
……
acc_free(A);
DK-OSU-GTC '13
• acc_deviceptr to get device pointer (in OpenACC 2.0) – Enables MPI communication from memory allocated by compiler when it is available in
OpenACC 2.0 implementations
– MVAPICH2 will detect the device pointer and optimize communication
– Expected to deliver the same performance as with CUDA
Using MVAPICH2 with the new OpenACC 2.0
51 DK-OSU-GTC '13
A = malloc(sizeof(int) * N);
…...
#pragma acc data copyin(A) . . .
{
#pragma acc parallel loop . . .
//compute for loop
MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
……
free(A);
• MVAPICH2 optimizes MPI communication on InfiniBand
clusters with GPUs
• Point-to-point, collective communication and datatype
processing are addressed
• Takes advantage of CUDA features like IPC and GPUDirect
RDMA
• Optimizations under the hood of MPI calls, hiding all the
complexity from the user
• High productivity and high performance
52
Conclusions
DK-OSU-GTC '13
Web Pointers
http://www.cse.ohio-state.edu/~panda
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
53 DK-OSU-GTC '13