mvapich2-gdr: pushing the frontier of mpi libraries

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies

Dhabaleswar K. (DK) Panda

The Ohio State UniversityE-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

GPU Technology Conference (GTC 2018)

by

Hari Subramoni

The Ohio State UniversityE-mail: [email protected]

http://www.cse.ohio-state.edu/~subramon

http://www.cse.ohio-state.edu/%7Epanda

http://www.cse.ohio-state.edu/%7Epanda

GTC 2018 2Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning

• Conclusions

Outline


Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,875 organizations in 86 countries

– More than 460,000 (> 0.46 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘17 ranking)• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 9th, 556,104 cores (Oakforest-PACS) in Japan

• 12th, 368,928-core (Stampede2) at TACC

• 17th, 241,108-core (Pleiades) at NASA

• 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

http://mvapich.cse.ohio-state.edu/


0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000Se

p-04

Feb-

05

Jul-0

5

Dec-

05

May

-06

Oct

-06

Mar

-07

Aug-

07

Jan-

08

Jun-

08

Nov

-08

Apr-

09

Sep-

09

Feb-

10

Jul-1

0

Dec-

10

May

-11

Oct

-11

Mar

-12

Aug-

12

Jan-

13

Jun-

13

Nov

-13

Apr-

14

Sep-

14

Feb-

15

Jul-1

5

Dec-

15

May

-16

Oct

-16

Mar

-17

Aug-

17

Jan-

18

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9

MV2

-GD

R 2.

0b

MV2

-MIC

2.0

MV2

-GD

R 2

.3a

MV2

-X2.

3bM

V2Vi

rt 2

.2

MV2

2.3

rc1

OSU

INAM

0.9

.3

MVAPICH2 Release Timeline and Downloads


MVAPICH2 Architecture

High Performance Parallel Programming ModelsMessage Passing Interface

(MPI)PGAS

(UPC, OpenSHMEM, CAF, UPC++)Hybrid --- MPI + X

(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-Awareness

Remote Memory Access

I/O andFile Systems

FaultTolerance

Virtualization Active Messages

Job StartupIntrospection

& Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, OmniPath)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL*), NVIDIA GPGPU)

Transport Protocols Modern Features

RC XRC UD DC UMR ODP* SR-IOV

Multi Rail

Transport MechanismsShared

Memory CMA IVSHMEM

Modern Features

MCDRAM* NVLink* CAPI*

* Upcoming


MVAPICH2 Software Family High-Performance Parallel Programming Libraries

MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE

MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime

MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs

MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud

MVAPICH2-EA Energy aware and High-performance MPI

MVAPICH2-MIC Optimized MPI for clusters with Intel KNC

Microbenchmarks

OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs

Tools

OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration

OEMT Utility to measure the energy consumption of MPI applications


CPU CPUQPI

GPU

PCIe

GPU

GPU

CPU

GPU

IB

Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host

7. Inter-Node GPU-Host6. Inter-Socket GPU-Host

Memory buffers

8. Inter-Node GPU-GPU with IB adapter on remote socketand more . . .

• NVLink is leading to more paths• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters


At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU


CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases

• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point communication

(GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-GPU

adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node

communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication from

GPU device buffers• Unified memory


• MVAPICH2-2.3 with GDR support can be downloaded from

https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

• System software requirements• Mellanox OFED 3.2 or later

• NVIDIA Driver 367.48 or later

• NVIDIA CUDA Toolkit 7.5/8.0/9.0 or later

• Plugin for GPUDirect RDMA

http://www.mellanox.com/page/products_dyn?product_family=116

• Strongly recommended

• GDRCOPY module from NVIDIAhttps://github.com/NVIDIA/gdrcopy

• Contact MVAPICH help list with any questions related to the package

[email protected]

Using MVAPICH2-GPUDirect Version

https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

http://www.mellanox.com/page/products_dyn?product_family=116

https://github.com/NVIDIA/gdrcopy

mailto:[email protected]


• Released on 11/09/2017

• Major Features and Enhancements

– Based on MVAPICH2 2.2

– Support for CUDA 9.0

– Add support for Volta (V100) GPU

– Support for OpenPOWER with NVLink

– Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink

– Enhanced performance of GPU-based point-to-point communication

– Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication

– Enhanced performance of MPI_Allreduce for GPU-resident data

– InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications• Basic support for IB-MCAST designs with GPUDirect RDMA

• Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA

• Advanced reliability support for IB-MCAST designs

– Efficient broadcast designs for Deep Learning applications

– Enhanced collective tuning on Xeon, OpenPOWER, and NVIDIA DGX-1 systems

MVAPICH2-GDR 2.3a


0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)


GPU-GPU Inter-node Bandwidth


0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)


GPU-GPU Inter-node Latency


MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.88us11X


• RoCE V1 and V2 support

• RDMA_CM connection support

• CUDA-Aware Collective Tuning– Point-point Tuning (available since MVAPICH2-GDR 2.0)

• Tuned thresholds for the different communication patterns and features

• Depending on the system configuration (CPU, HCA and GPU models)

– Tuning Framework for GPU based collectives • Select the best algorithm depending on message size, system size and system

configuration

• Support for Bcast and Gather operations for different GDR-enabled systems

• Available since MVAPICH2-GDR 2.2RC1 release

ROCE and Optimized Collectives Support


• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

MV2 MV2+GDR

0500

100015002000250030003500

4 8 16 32Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

64K Particles 256K Particles

2X2X




Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

00.20.40.60.8

11.2

4 8 16 32

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/


http://www2.cosmo-model.org/content






• Conclusions

Outline


Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1

• Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect

• Up to 30% higher D2D bandwidth on DGX-1 with NVLink

•

05000

10000150002000025000300003500040000

128K 256K 512K 1M 2M 4M

Mill

ion

Byte

s (M

B)/s

econ

d


Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design

1-stream 4-streams

16% better

02000400060008000

100001200014000160001800020000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Mill

ion

Byte

s (M

B)/s

econ

d


Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design

1-stream 4-streams

30% better

Available with MVAPICH2-GDR-2.3a


0

5000

10000

15000

Band

wid

th (M

Bps)


INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)

CMA-based Intra-node Host-to-Host Communication Support

0100200300400500600

Late

ncy

(us)


INTRA-NODE Pt-to-Pt (H2H) LATENCY

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)

MVAPICH2-GDR-2.3aIntel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores

NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCACUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA

• Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth

30% better30% better


• Multi-dimensional data• Row based organization• Contiguous on one dimension • Non-contiguous on other dimensions

• Halo data exchange• Duplicate the boundary• Exchange the boundary in each

iteration

Halo data exchange

Non-contiguous Data Exchange


MPI Datatype support in MVAPICH2• Datatypes support in MPI

– Operate on customized datatypes to improve productivity

– Enable MPI library to optimize non-contiguous data

At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);MPI_Type_commit(&new_type);…MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks- Efficiently move data between nodes using RDMA- In progress - currently optimizes vector and hindexed datatypes- Transparent to the userH. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.


MPI Datatype Processing (Computation Optimization )

• Comprehensive support • Targeted kernels for regular datatypes - vector, subarray, indexed_block

• Generic kernels for all other irregular datatypes

• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels

• Flexible set of parameters for users to tune kernels• Vector

• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_VECTOR_YSIZE

• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM• MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM

• Indexed_block• MV2_CUDA_KERNEL_IDXBLK_XDIM


MPI Datatype Processing (Communication Optimization )

Waste of computing resources on CPU and GPUCommon Scenario

*A, B…contain non-contiguousMPI Datatype

MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…

MPI_Waitall (…);





• Conclusions

Outline


Initial (Basic) Support for GPU Managed Memory

● CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified) memory allowing a common memory allocation for GPU or CPU through cudaMallocManaged() call

● Significant productivity benefits due to abstraction of explicit allocation and cudaMemcpy()

● Extended MVAPICH2 to perform communications directly from managed buffers (Available since MVAPICH2-GDR 2.2b)

● OSU Micro-benchmarks extended to evaluate the performance of point-to-point and collective communications using managed buffers

● Available since OMB 5.2D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance Communication Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, held in conjunction with PPoPP ‘16

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 4 8 16 32 64 128 256 1K 4K 8K 16K

Halo

Exc

hang

e Ti

me

(ms)

Total Dimension Size (Bytes)

2D Stencil Performance for Halowidth=1

DeviceManaged


Enhanced Support for Intra-node Managed Memory

● CUDA Managed => no memory pin down ● No IPC support for intra-node communication ● No GDR support for Inter-node

communication● Initial and basic support in MVAPICH2-GDR

● For both intra- and inter-nodes use “pipeline through” host memory

● Enhance intra-node managed memory to use IPC● Double buffering pair-wise IPC-based scheme ● Brings IPC performance to Managed memory ● High performance and high productivity● 2.5 X improvement in bandwidth

● Available since MVAPICH2-GDR 2.2RC1

0

200

400

600

800

1000

1200

4K 16K 64K 256K 1M 4M

EnhancedMV2-GDR 2.2b

Message Size (bytes)

Late

ncy

(us)

02000400060008000

10000

32K 128K 512K 2M

EnhancedMV2-GDR 2.2b


Band

wid

th (M

B/s)

2.5X





• Conclusions

Outline


• Streaming applications on HPC systems

1. Communication (MPI)• Broadcast-type operations

2. Computation (CUDA)• Multiple GPU nodes as workers

Streaming ApplicationsData Source

SenderHPC resources for real-time analytics

Real-time streaming

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

Data streaming-like broadcast operations


IB HCA

CPU

GPU

Source

IB Switch

Header

Data

IB HCA

CPU

GPU

Destination 1Header

Data

IB HCA

CPU

GPU

Destination NHeader

Data

1. IB Gather + GDR Read2. IB Hardware Multicast3. IB Scatter + GDR Write

• For GPU-resident data, using– GPUDirect RDMA (GDR)

– InfiniBand Hardware Multicast (IB-MCAST)

• Overhead– IB UD limit

– GDR limit

Hardware Multicast-based Broadcast

A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014, Dec 2014.


• Preparing Intermediate buffer (im_buf)– Page-locked (pinned) host buffer

Fast Device-Host data movement

– Allocated at initialization phase Low overhead

• Streaming data through host– Fine-tuned chunked data

– Asynchronous copy operations

Three-stage pipeline

IB HCA

CPU

GPU

Source

IB Switch

Header

d_out

1. Data Preparation2. IB Gather 3. IB Hardware Multicast

im_buf

Optimized Broadcast Send

MPI_Bcast(d_out,…)

C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017.


• Zero-copy broadcast receive – Pre-posted user buffer (d_in)

– Avoids additional data movement

– Leverages IB Scatter and GDR features

Low-latency

Free-up PCIe resources for applications

IB Switch

IB HCA

CPU

GPU

Destination 1Header

d_in

IB HCA

CPU

GPU

Destination NHeader

d_in

IB Hardware MulticastIB Scatter (GDR Write)

Optimized Broadcast ReceiveMPI_Bcast(d_in,…)

C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017.


• Proposed Intra-node Topology-Aware Broadcast– CUDA InterProcess Communication (IPC)

Broadcast on Multi-GPU systems

Node 1

IB Switch

GPU 0 GPU 1 GPU N

Node NGPU

CPU

Source

GPU

CPU

CPUMulticast steps

cudaMemcpy(Device ↔ Device)

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.

Available in MVAPICH2-GDR 2.3a


0

10

20

30

40

50

60

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K

Late

ncy

(μs)


MCAST-GDR-OPT MCAST-GDR

• IB-MCAST + GDR + Topology-aware IPC-based schemes – Up to 58% and 79% reduction

for small and large messages

Streaming Benchmark @ CSCS (88 GPUs)

58%

0

2000

4000

6000

8000

10000

12000

32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(μs)


MCAST-GDR-OPT MCAST-GDR

79%

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.


0

0.5

1

1.5

8 GPUs 16 GPUs 8 GPUs 16 GPUs 8 GPUs 16 GPUs

AlexNet VGG ResNet-50

Spee

dup

MV2-GDR-Knomial MV2-GDR-Ring MV2-MCAST-GDR-Opt

• @ RI2 cluster, 16 GPUs, 1 GPU/node– CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) [2]

Application-based Evaluation: CUDA-Aware CNTK

Higher is better

18%24%

[1] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, ICPP’17.[2] D. S. Banerjee, K. Hamidouche, and D. K. Panda, Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters, IEEE CloudCom’16

• Reduces up to 24%, 16% and 18% of latency for AlexNet, VGG and ResNet models• Higher improvement can be observed for larger system sizes

16%





• Conclusions

Outline


• Deep Learning frameworks are a different game altogether

– Unusually large message sizes (order of megabytes)

– Most communication based on GPU buffers

• Existing State-of-the-art– cuDNN, cuBLAS, NCCL --> scale-up performance

– NCCL2, CUDA-Aware MPI --> scale-out performance• For small and medium message sizes only!

• Proposed: Can we co-design the MPI runtime (MVAPICH2-GDR) and the DL framework (Caffe) to achieve both?

– Efficient Overlap of Computation and Communication

– Efficient Large-Message Communication (Reductions)

– What application co-designs are needed to exploit communication-runtime co-designs?

Deep Learning: New Challenges for MPI Runtimes

Scal

e-up

Per

form

ance

Scale-out Performance

cuDNN

NCCL

gRPC

Hadoop

ProposedCo-

Designs

MPIcuBLAS

A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)

NCCL2


0100200300

2 4 8 16 32 64 128

Late

ncy

(ms)

Message Size (MB)

Reduce – 192 GPUsLarge Message Optimized Collectives for Deep Learning

0

100

200

128 160 192

Late

ncy

(ms)

No. of GPUs

Reduce – 64 MB

0100200300

16 32 64

Late

ncy

(ms)

No. of GPUs

Allreduce - 128 MB

0

50

100

2 4 8 16 32 64 128

Late

ncy

(ms)

Message Size (MB)

Bcast – 64 GPUs

0

50

100

16 32 64

Late

ncy

(ms)

No. of GPUs

Bcast 128 MB

• MV2-GDR provides optimized collectives for large message sizes

• Optimized Reduce, Allreduce, and Bcast

• Good scaling with large number of GPUs

• Available since MVAPICH2-GDR 2.2GA

0100200300

2 4 8 16 32 64 128La

tenc

y (m

s)Message Size (MB)

Allreduce – 64 GPUs


0

1000

2000

3000

4000

5000

6000

512K 1M 2M 4M

Late

ncy

(us)


MVAPICH2 BAIDU OPENMPI

0100000200000300000400000500000600000700000800000

8388

608

1677

7216

3355

4432

6710

8864

1342

1772

8

2684

3545

6

5368

7091

2

Late

ncy

(us)



1

10

100

1000

10000

100000

4 16 64 256

1024

4096

1638

4

6553

6

2621

44

Late

ncy

(us)



• Initial Evaluation shows promising performance gains for MVAPICH2-GDR 2.3a*• 8 GPUs (2 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

MVAPICH2: Allreduce Comparison with Baidu and OpenMPI

*Available with MVAPICH2-GDR 2.3a

~10X better

MV2 is ~20% better than Baidu

~3.5X betterOpenMPI is ~4X slower

than Baidu


0

10000

20000

30000

40000

50000

512K 1M 2M 4M

Late

ncy

(us)



0100000020000003000000400000050000006000000

8388

608

1677

7216

3355

4432

6710

8864

1342

1772

8

2684

3545

6

5368

7091

2

Late

ncy

(us)



1

10

100

1000

10000

100000

4 16 64 256

1024

4096

1638

4

6553

6

2621

44

Late

ncy

(us)



• 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

MVAPICH2: Allreduce Comparison with Baidu and OpenMPI

*Available with MVAPICH2-GDR 2.3a

~30X betterMV2 is ~2X better

than Baidu

~10X better OpenMPI is ~5X slower than Baidu

~4X better


• Caffe : A flexible and layered Deep Learning framework.

• Benefits and Weaknesses– Multi-GPU Training within a single node

– Performance degradation for GPUs across different sockets

– Limited Scale-out

• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across

multi-GPU nodes)

– Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset

– Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset

OSU-Caffe: Scalable Deep Learning

0

50

100

150

200

250

8 16 32 64 128

Trai

ning

Tim

e (s

econ

ds)

No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use case

OSU-Caffe publicly available from

http://hidl.cse.ohio-state.edu/

Support on OPENPOWER will be available soon






• Conclusions

Outline


0

0.5

1

1.5

1 4 16 64 256 1K 4K

Late

ncy

(us)

MVAPICH2-GDR 2.3a

Point-to-Point Host-level Performance on OpenPower

Platform: OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA

Intra-node Latency Intra-node Bi-directional BandwidthIntra-node Bandwidth

0

10000

20000

30000

40000

1 4 16 64 256 1K 4K 16K

64K

256K 1M

Band

wid

th (M

B/s)

MVAPICH2-GDR 2.3a

0

20000

40000

60000

80000

1 4 16 64 256 1K 4K 16K

64K

256K 1M

Band

wid

th (M

B/s)

MVAPICH2-GDR 2.3a

0

1

2

3

4

5

1 4 16 64 256 1K 4K

Late

ncy

(us)

MVAPICH2-GDR 2.3a

0

5000

10000

15000

1 4 16 64 256 1K 4K 16K

64K

256K 1M

Band

wid

th (M

B/s)

MVAPICH2-GDR…

0

5000

10000

15000

20000

25000

1 4 16 64 256 1K 4K 16K

64K

256K 1M

Band

wid

th (M

B/s)

MVAPICH2-GDR…

Inter-node Latency Inter-node Bi-directional BandwidthInter-node Bandwidth

~0.5 μs

~2.3 μs

~30GB/s ~60GB/s

~12GB/s

~24GB/s


0

10

20

30

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)


INTRA-NODE LATENCY (SMALL)

INTRA-SOCKET(NVLINK) INTER-SOCKET

Device-to-Device Performance on OpenPOWER (NVLink + Pascal)

010203040

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)


INTRA-NODE BANDWIDTH


0

5

10

15

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)


INTER-NODE BANDWIDTH

Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and EDR InfiniBand Inter-connect

0

200

400

16K 32K 64K 128K256K512K 1M 2M 4M

Late

ncy

(us)


INTRA-NODE LATENCY (LARGE)


202224262830

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)


INTER-NODE LATENCY (SMALL)

0100200300400500

Late

ncy

(us)


INTER-NODE LATENCY (LARGE)

Intra-node Bandwidth: 33.9 GB/sec (NVLINK)Intra-node Latency: 14.6 us (without GPUDirectRDMA)

Inter-node Latency: 23.8 us (without GPUDirectRDMA) Inter-node Bandwidth: 11.9 GB/sec (EDR)Available in MVAPICH2-GDR 2.3a





• Conclusions

Outline


• Increasing trend to provide container support for MPI Libraries • Ease of build• Portability• Reproducibility

• MVAPICH2-GDR 2.3a provides container (Docker) support• More details are available in the MVAPICH2-GDR User Guide

• http://mvapich.cse.ohio-state.edu/userguide/gdr/2.3a/

• Synergistic with the HPC-Container-Maker and hpccm efforts by NVIDIA

• (https://github.com/NVIDIA/hpc-container-maker)

Container Support

http://mvapich.cse.ohio-state.edu/userguide/gdr/2.3a/

https://email.osu.edu/owa/redir.aspx?C=HGDWhSa0G9vTky2KV2Oa5FnZZ4ptwAwe_DxWvaZ5nhtpvkMZXJTVCA..&URL=https://github.com/NVIDIA/hpc-container-maker


0

5000

10000

15000

20000

Band

wid

th (M

B/s)


GPU-GPU Inter-node Bi-Bandwidth

Docker Native

02000400060008000

1000012000

Band

wid

th (M

B/s)


GPU-GPU Inter-node Bandwidth

Docker Native

1

10

100

1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Late

ncy

(us)


GPU-GPU Inter-node Latency

Docker Native

MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

MVAPICH2-GDR on Container with Negligible Overhead





• Conclusions

Outline


0

10

20

30

40

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0

Scalable Host-based Collectives with CMA (Intra-node Reduce & AlltoAll)(N

odes

=1, P

PN=2

0)

0

50

100

150

200

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)


(Nod

es=1

, PPN

=20)

Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively

3.6X

Alltoall

Reduce

5.2X

3.2X3.3X

0

400

800

1200

1600

2000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)


(Nod

es=1

, PPN

=20)

0

2500

5000

7500

10000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)


(Nod

es=1

, PPN

=20)

3.2X

1.4X

1.3X1.2X


0

1000

2000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)



0

25

50

75

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)



0

15

30

45

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)



Scalable Host-based Collectives with CMA (Intra-node, Gather & Scatter)(N

odes

=1, P

PN=2

0)

0

250

500

750

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)



(Nod

es=1

, PPN

=20)

(Nod

es=1

, PPN

=20)

(Nod

es=1

, PPN

=20)

Up to 24X and 15x performance improvement by MVAPICH2 for medium to large messages respectively

Scatter

Gather24.5X

5.6X

8.3X

15.6X

4.9X

1.5X

6.4X 6.7X


0

10

20

30

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-GDR-NextOpenMPI-3.0.0

Scalable Host-based Collectives with CMA (Multi-node, Reduce & Alltoall)(N

odes

=4, P

PN=2

0)

0

500

1000

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)


(Nod

es=4

, PPN

=20)

Up to 12.4X and 8.5X performance improvement by MVAPICH2 for small and large messages respectively

12.4X

Alltoall

Reduce

1.9X

0

2000

4000

6000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

MVAPICH2-GDR-Next

OpenMPI-3.0.0

(Nod

es=4

, PPN

=20)

0

50000

100000

150000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)


(Nod

es=4

, PPN

=20)

8.5X


0

1000

2000

3000

4000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)


MVAPICH2-GDR-Next

OpenMPI-3.0.0

0

20

40

60

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)


MVAPICH2-GDR-Next

OpenMPI-3.0.0

0

20

40

60

80

100

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)


MVAPICH2-GDR-Next

OpenMPI-3.0.0

Scalable Host-based Collectives with CMA (Multi-node, Gather & Scatter)(N

odes

=4, P

PN=2

0)

0

1000

2000

3000

4000

5000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)


MVAPICH2-GDR-Next

OpenMPI-3.0.0

(Nod

es=4

, PPN

=20)

(Nod

es=4

, PPN

=20)

(Nod

es=4

, PPN

=20)

Up to 17.8X and 3.9X improvement with MVAPICH2-GDR-Next for medium to large messages respectively

Scatter

Gather17.8X 3.9X

1.6X





• Conclusions

Outline


0

1000

2000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

3X0

2000

4000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

Message Size

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

34%

0

1000

2000

3000

4000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

Message Size

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

0

1000

2000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

Optimized All-Reduce with XPMEM(N

odes

=1, P

PN=2

0)

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch

• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node

2X

(Nod

es=2

, PPN

=20)

4X48%

3.3X

2X

2X


0

1000

2000

3000

4000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

Message Size

MVAPICH2-GDR-Next

OpenMPI-3.0.0

0

1000

2000

3000

4000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

MVAPICH2-GDR-Next

OpenMPI-3.0.0

0

1000

2000

3000

4000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

MVAPICH2-GDR-Next

OpenMPI-3.0.0

Optimized All-Reduce with XPMEM (N

odes

=3, P

PN=2

0)


• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over OpenMPI for inter-node.

0

1000

2000

3000

4000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

Message Size

MVAPICH2-GDR-Next

OpenMPI-3.0.0

(Nod

es=4

, PPN

=20)

42% 27%

2X 35%


0

4000

8000

12000

16000La

tenc

y (u

s)

MVAPICH2-GDR-NextSpectrumMPI-10.1.0OpenMPI-3.0.0

Optimized Reduce with XPMEM(N

odes

=1, P

PN=2

0)

0

8000

16000

24000

32000

40000

Late

ncy

(us)

Message Size

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0


(Nod

es=2

, PPN

=20)

• Optimized MPI Reduce Design in MVAPICH2– Up to 3.1X performance improvement over OpenMPI at scale and up to 37% over spectrum MPI on intra-node

0

5000

10000

15000

Late

ncy

(us)

MVAPICH2-GDR-NextSpectrumMPI-10.1.0OpenMPI-3.0.0

0

10000

20000

30000

40000

Late

ncy

(us)

Message Size

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

26%

2.8X

27%

1.9X

37%

5.6X

93%

3.1X


08000

1600024000320004000048000

Late

ncy

(us)

MVAPICH2-GDR-Next

OpenMPI-3.0.0

Optimized Reduce with XPMEM(N

odes

=3, P

PN=2

0)

08000

1600024000320004000048000

Late

ncy

(us)

Message Size

MVAPICH2-GDR-Next

OpenMPI-3.0.0


(Nod

es=4

, PPN

=20)

• Optimized MPI Reduce Design in MVAPICH2– Up to 7X performance improvement over OpenMPI at scale

0

10000

20000

30000

40000

50000

Late

ncy

(us)

MVAPICH2-GDR-Next

OpenMPI-3.0.0

0

10000

20000

30000

40000

50000

Late

ncy

(us)

Message Size

MVAPICH2-GDR-Next

OpenMPI-3.0.0

4X6X

7X 4X





• Conclusions

Outline


1

10

100

1000

10000

100000

Late

ncy

(us)


NCCL2 MVAPICH2-GDR

MVAPICH2-GDR vs. NCCL2 – Broadcast Operation• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases

• MPI_Bcast (MVAPICH2-GDR) vs. ncclBcast (NCCL2) on 16 K-80 GPUs

*Will be available with upcoming MVAPICH2-GDR 2.3b

1

10

100

1000

10000

100000

Late

ncy

(us)


NCCL2 MVAPICH2-GDR

~10X better~4X better

1 GPU/node 2 GPUs/node

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-80 GPUs, and EDR InfiniBand Inter-connect


MVAPICH2-GDR vs. NCCL2 – Reduce Operation• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases

• MPI_Reduce (MVAPICH2-GDR) vs. ncclReduce (NCCL2) on 16 GPUs


1

10

100

1000

10000

100000

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~5X better


1

10

100

1000

4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~2.5X better


MVAPICH2-GDR vs. NCCL2 – Allreduce Operation• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs


1

10

100

1000

10000

100000

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~1.2X better


1

10

100

1000

4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~3X better





• Conclusions

Outline


Out-of-Core Deep Neural Network Training with Caffe• Large DNNs cannot be trained on GPUs due to memory limitation!

– ResNet-50 is the state-of-the-art DNN architecture for Image Recognition but current frameworks can only go up to a small batch size of 45

– Next generation models like Neural Machine Translation (NMT) are ridiculously large, consists of billions of parameters, and require even more memory

– Can we design Out-of-core DNN training support using new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs?

• General intuition is that managed allocations “will be” slow!

– The proposed framework called OC-Caffe (Out-of-Core Caffe) shows the potential of managed memory designs that can provide performance with negligible/no overhead.

– In addition to Out-of-core Training support, productivity can be greatly enhanced in terms of DL framework design by using the new Unified Memory features.

Submission Under Review


• Comparable performance to Caffe-Default for “in-memory” batch sizes

• OC-Caffe-Opt: up to 5X improvement over Intel-MKL-optimized CPU-based AlexNet training on Volta V100 GPU with CUDA9 and CUDNN7

Performance Trends for OC-Caffe

OC-Caffe will be released by the HiDL [email protected]

Out-of-core (over-subscription)Trainable (in-memory)




• OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7

• OC-Caffe allows efficient scale-up on DGX-1 system with Pascal P100 GPUs with CUDA9 and CUDNN7

Performance Trends for OC-Caffe

Out-of-core (over-subscription) Scale-up on DGX-1



Can Unified-Memory also Simplify Framework Design?• Enhanced and simplified the Caffe

framework• Simplified Layer class and all inherited

classes (e.g. ConvolutionLayer)• Remove almost all of the memory allocation,

movement, and state management code in SyncedMemory and Blob class

• Estimated 3,000 lines of repetitive and error-prone code can be eliminated

• Developers can add new inherited Layer classes in a much simpler manner

– E.g. Implement a single Forward function instead of forward_gpu() and forward_cpu() function separately.

class ConvolutionLayer {public:

void cpu_data()void cpu_diff()void gpu_data()void gpu_diff()

void mutable_cpu_data()void mutable_cpu_diff()void mutable_gpu_data()void mutable_gpu_diff()

void Forward_cpu()void Forward_gpu()void forward_cpu_gemm()void forward_gpu_gemm()void forward_cpu_bias()void forward_gpu_bias()

void Backward_cpu()void Backward_gpu()void backward_cpu_gemm()void backward_gpu_gemm()void backward_cpu_bias()void backward_gpu_bias()

}

class ConvolutionLayer{public:

void data()void diff()

void mutable_data()void mutable_diff()

void Forward()void forward_gemm()void forward_bias()

void Backward()void backward_gemm()void backward_bias()

}

Existing Design

Proposed High-Productivity Design based on Managed Memory Allocation and Data MovementSubmission Under Review





• Conclusions

Outline


• MVAPICH2-GDR MPI library optimizes MPI communication on InfiniBand and RoCE (V1 and V2) clusters with GPUs on both x86 and OpenPOWER platforms

• Provides optimized designs for point-to-point two-sided and one-sided communication, datatype processing and collective operations

• Takes advantage of CUDA features like IPC and GPUDirect RDMA families

• Allows flexible solutions for streaming applications with GPUs

• Provides optimized solutions for High-Performance Deep Learning (HiDL) frameworks and applications

Conclusions


• CE8118 - Connect with the Experts: MVAPICH for GPU

• Session Schedule• Wednesday, Mar 28, 3:00 PM - 4:00 PM, LL Pod A

• Session Speakers• Dhabaleswar Panda - Professor and University Distinguished Scholar, OSU

• Davide Rossetti - Senior Software Engineer, NVIDIA

• Hari Subramoni - Research Scientist, OSU

• Session Description• MVAPICH is a well established MPI library supporting GPUDirect. Meet with the experts to

discuss how you can optimize your HPC and AI application using GPUDirect RDMA, with MVAPICH GDR, and learn about MVAPICH GDS, newest feature, supporting GPUDirect Async.

Join Us for a Connect with the Experts Session


Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– R. Biswas (M.S.)

– M. Bayatpour (Ph.D.)

– S. Chakraborthy (Ph.D.)– C.-H. Chu (Ph.D.)

– S. Guganani (Ph.D.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– R. Rajachandrasekar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– J. Hashmi (Ph.D.)

– H. Javed (Ph.D.)– P. Kousha (Ph.D.)

– D. Shankar (Ph.D.)

– H. Shi (Ph.D.)

– J. Zhang (Ph.D.)

– J. Lin

– M. Luo

– E. Mancini

Current Research Scientists– X. Lu

– H. Subramoni

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– M. Arnold

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– A. Ruhela

Current Students (Undergraduate)– N. Sarkauskas (B.S.)


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected], [email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/


mvapich2-gdr: pushing the frontier of mpi libraries

Documents