software libraries and middleware for exascale systems · 2020-01-16 · network based computing...
TRANSCRIPT
Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at HPCAC-AI Switzerland Conference (April ‘19)
by
2Network Based Computing Laboratory HPCAC-Swiss (April ’19)
High-End Computing (HEC): PetaFlop to ExaFlop
Expected to have an ExaFlop system in 2021!
100 PFlops in
2017
1 EFlops in
2021?
143
PFlops
in 2018
3Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!
Increasing Need to Run these applications on the Cloud!!
4Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
Physical Compute
5Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
6Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
7Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
Spark Job
Hadoop Job Deep LearningJob
8Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Designing Communication Middleware for Multi-Petaflop and Exaflop Systems: Challenges
Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
Application Kernels/Applications
Networking Technologies(InfiniBand, 40/100GigE,
Aries, and OmniPath)
Multi/Many-coreArchitectures
Accelerators(NVIDIA and FPGA)
MiddlewareCo-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Fault-
Resilience
Communication Library or Runtime for Programming Models
Point-to-point
Communication
Collective
Communication
Energy-
Awareness
Synchronization
and Locks
I/O and
File Systems
Fault
Tolerance
9Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness
• HiDL Project – High-Performance Deep Learning
• HiBD Project – High-Performance Big Data Analytics Library
• Commercial Support from X-ScaleSolutions
• Conclusions and Q&A
Presentation Overview
10Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• PGAS models and Hybrid MPI+PGAS models are gradually receiving
importance
11Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,000 organizations in 88 countries
– More than 531,000 (> 0.5 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘18 ranking)
• 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
• 14th, 556,104 cores (Oakforest-PACS) in Japan
• 17th, 367,024 cores (Stampede2) at TACC
• 27th, 241,108-core (Pleiades) at NASA and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
Partner in the upcoming TACC Frontera System
12Network Based Computing Laboratory HPCAC-Swiss (April ’19)
0
100000
200000
300000
400000
500000
600000
Sep
-04
Feb
-05
Jul-
05
Dec
-05
May
-06
Oct
-06
Mar
-07
Au
g-0
7
Jan
-08
Jun
-08
No
v-0
8
Ap
r-0
9
Sep
-09
Feb
-10
Jul-
10
Dec
-10
May
-11
Oct
-11
Mar
-12
Au
g-1
2
Jan
-13
Jun
-13
No
v-1
3
Ap
r-1
4
Sep
-14
Feb
-15
Jul-
15
Dec
-15
May
-16
Oct
-16
Mar
-17
Au
g-1
7
Jan
-18
Jun
-18
No
v-1
8
Nu
mb
er o
f D
ow
nlo
ads
Timeline
MV
0.9
.4
MV
2 0
.9.0
MV
2 0
.9.8
MV
21
.0
MV
1.0
MV
21
.0.3
MV
1.1
MV
21
.4
MV
21
.5
MV
21
.6
MV
21
.7
MV
21
.8
MV
21
.9
MV
2-G
DR
2.0
b
MV
2-M
IC 2
.0
MV
2 2
.3.1
MV
2-X
2.3
rc1
MV
2V
irt
2.2
MV
2-G
DR
2.3
OSU
INA
M 0
.9.4
MVAPICH2 Release Timeline and Downloads
13Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-
point
Primitives
Collectives
Algorithms
Energy-
Awareness
Remote
Memory
Access
I/O and
File Systems
Fault
ToleranceVirtualization
Active
MessagesJob Startup
Introspection
& Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODPSR-
IOV
Multi
Rail
Transport Mechanisms
Shared
MemoryCMA IVSHMEM
Modern Features
MCDRAM* NVLink CAPI*
* Upcoming
XPMEM
14Network Based Computing Laboratory HPCAC-Swiss (April ’19)
MVAPICH2 Software Family
Requirements Library
MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2
Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE
MVAPICH2-X
MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
MPI Energy Monitoring Tool OEMT
InfiniBand Network Analysis and Monitoring OSU INAM
Microbenchmarks for Measuring MPI and PGAS Performance OMB
15Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Convergent Software Stacks for HPC, Big Data and Deep Learning
16Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication
– Optimized Collectives using SHArP and Multi-Leaders
– Optimized CMA-based and XPMEM-based Collectives
– Asynchronous Progress
• Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM
• Application Scalability and Best Practices
• Exploiting Accelerators (NVIDIA GPGPUs)
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
17Network Based Computing Laboratory HPCAC-Swiss (April ’19)
One-way Latency: MPI over IB with MVAPICH2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2 Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.11
1.19
0.98
1.15
1.04
TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch
Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch
0
20
40
60
80
100
120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-5-EDROmni-Path
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
18Network Based Computing Laboratory HPCAC-Swiss (April ’19)
TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 IB switch
Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch
Bandwidth: MPI over IB with MVAPICH2
0
5000
10000
15000
20000
25000TrueScale-QDR
ConnectX-3-FDR
ConnectIB-DualFDR
ConnectX-5-EDR
Omni-Path
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
22,564
12,161
21,983
6,228
24,136
0
2000
4000
6000
8000
10000
12000
14000 Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
12,590
3,373
6,356
12,35812,366
19Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Cooperative Rendezvous Protocols
Platform: 2x14 core Broadwell 2680 (2.4 GHz)
Mellanox EDR ConnectX-5 (100 GBps)
Baseline: MVAPICH2X-2.3rc1, Open MPI v3.1.0Cooperative Rendezvous Protocols for Improved Performance and Overlap
S. Chakraborty, M. Bayatpour,, J Hashmi, H. Subramoni, and DK Panda,
SC ‘18 (Best Student Paper Award Finalist)
19%16% 10%
• Use both sender and receiver CPUs to progress communication concurrently
• Dynamically select rendezvous protocol based on communication primitives and sender/receiver
availability (load balancing)
• Up to 2x improvement in large message latency and bandwidth
• Up to 19% improvement for Graph500 at 1536 processes
0
500
1000
1500
28 56 112 224 448 896 1536
Tim
e (S
eco
nd
s)
Number of Processes
Graph500MVAPICH2
Open MPI
Proposed
0
200
400
28 56 112 224 448 896 1536
Tim
e (S
eco
nd
s)
Number of Processes
CoMD
MVAPICH2
Open MPI
Proposed
0
50
100
28 56 112 224 448 896 1536
Tim
e (S
eco
nd
s)
Number of Processes
MiniGhost
MVAPICH2
Open MPI
Proposed
Will be available in upcoming MVAPICH2-X 2.3rc2
20Network Based Computing Laboratory HPCAC-Swiss (April ’19)
0
0.1
0.2
0.3
0.4
(4,28) (8,28) (16,28)La
ten
cy (
seco
nd
s)
(Number of Nodes, PPN)
MVAPICH2
Benefits of SHARP Allreduce at Application Level
12%
Avg DDOT Allreduce time of HPCG
SHARP support available since MVAPICH2 2.3a
Parameter Description Default
MV2_ENABLE_SHARP=1 Enables SHARP-based collectives Disabled
--enable-sharp Configure flag to enable SHARP Disabled
• Refer to Running Collectives with Hardware based SHARP support section of MVAPICH2 user guide for more information
• http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3-userguide.html#x1-990006.26
21Network Based Computing Laboratory HPCAC-Swiss (April ’19)
MPI_Allreduce on KNL + Omni-Path (10,240 Processes)
0
50
100
150
200
250
300
4 8 16 32 64 128 256 512 1024 2048 4096
Late
ncy
(u
s)
Message SizeMVAPICH2 MVAPICH2-OPT IMPI
0
200
400
600
800
1000
1200
1400
1600
1800
2000
8K 16K 32K 64K 128K 256K
Message SizeMVAPICH2 MVAPICH2-OPT IMPI
OSU Micro Benchmark 64 PPN
2.4X
• For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4X
M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based
Multi-Leader Design, SuperComputing '17. Available in MVAPICH2-X 2.3b
22Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Optimized CMA-based Collectives for Large Messages
1
10
100
1000
10000
100000
10000001K 2K 4K 8K
16K
32K
64K
128K
256K
512K 1M 2M 4M
Message Size
KNL (2 Nodes, 128 Procs)
MVAPICH2-2.3a
Intel MPI 2017
OpenMPI 2.1.0
Tuned CMA
Late
ncy
(u
s)
1
10
100
1000
10000
100000
1000000
1K 2K 4K 8K
16K
32K
64K
128K
256
K
512K 1M 2M
Message Size
KNL (4 Nodes, 256 Procs)
MVAPICH2-2.3a
Intel MPI 2017
OpenMPI 2.1.0
Tuned CMA1
10
100
1000
10000
100000
1000000
1K 2K 4K 8K
16K
32K
64K
128K
256K
512K 1M
Message Size
KNL (8 Nodes, 512 Procs)
MVAPICH2-2.3a
Intel MPI 2017
OpenMPI 2.1.0
Tuned CMA
• Significant improvement over existing implementation for Scatter/Gather with 1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPower)
• New two-level algorithms for better scalability• Improved performance for other collectives (Bcast, Allgather, and Alltoall)
~ 2.5xBetter
~ 3.2xBetter
~ 4xBetter
~ 17xBetter
S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI
Collectives for Multi/Many-core Systems, IEEE Cluster ’17, BEST Paper Finalist
Performance of MPI_Gather on KNL nodes (64PPN)
Available in MVAPICH2-X 2.3b
23Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Shared Address Space (XPMEM)-based Collectives Design
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(u
s)
Message Size
MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-X-2.3rc1
OSU_Allreduce (Broadwell 256 procs)
• “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2
• Offloaded computation/communication to peers ranks in reduction collective operation
• Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce
73.2
1.8X
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Message Size
MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-2.3rc1
OSU_Reduce (Broadwell 256 procs)
4X
36.1
37.9
16.8
J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction
Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.Available in MVAPICH2-X 2.3rc1
24Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Efficient Zero-copy MPI Datatypes for Emerging Architectures
• New designs for efficient zero-copy based MPI derived datatype processing
• Efficient schemes mitigate datatype translation, packing, and exchange overheads
• Demonstrated benefits over prevalent MPI libraries for various application kernels
• To be available in the upcoming MVAPICH2-X release
0.1
1
10
100
2 4 8 16 28
Lo
gsca
le L
ate
ncy
(mill
ise
co
nd
s)
No. of Processes
MVAPICH2X-2.3IMPI 2018IMPI 2019MVAPICH2X-Opt
5X
0.1
1
10
100
1000
Grid Dimensions (x, y, z, t)
MVAPICH2X-2.3
IMPI 2018
MVAPICH2X-Opt
19X
0.01
0.1
1
10
Grid Dimensions (x, y, z, t)
MVAPICH2X-2.3
IMPI 2018
MVAPICH2X-Opt3X
3D-Stencil Datatype Kernel on
Broadwell (2x14 core)
MILC Datatype Kernel on KNL 7250 in Flat-Quadrant Mode (64-core)
NAS-MG Datatype Kernel on OpenPOWER (20-core)
FALCON: Efficient Designs for Zero-copy MPI Datatype Processing on Emerging Architectures, J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, D. K. (DK) Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS ’19), May 2019. BEST PAPER NOMINEE
25Network Based Computing Laboratory HPCAC-Swiss (April ’19)
0
5000
10000
15000
20000
25000
30000
224 448 896
Per
form
ance
in G
FLO
PS
Number of Processes
MVAPICH2 Async MVAPICH2 Default IMPI 2019 Default
0
1
2
3
4
5
6
7
8
9
112 224 448
Tim
e p
er
loo
p in
sec
on
ds
Number of Processes
MVAPICH2 Async MVAPICH2 Default IMPI 2019 Async
Benefits of the New Asynchronous Progress Design: Broadwell + InfiniBand
Up to 33% performance improvement in P3DFFT application with 448 processesUp to 29% performance improvement in HPL application with 896 processes
Memory Consumption = 69%
P3DFFT High Performance Linpack (HPL)
26%
27% Lower is better Higher is better
A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and D.K. Panda, Efficient Asynchronous Communication Progress for MPI without Dedicated Resources, EuroMPI 2018. Enhanced version accepted for PARCO Journal.
Available in MVAPICH2-X 2.3rc1
PPN=28
33%
29%
12%
PPN=28
8%
26Network Based Computing Laboratory HPCAC-Swiss (April ’19)
0
0.5
1
0 1 2 4 8 16 32 64 128 256 512 1K 2K
Late
ncy
(u
s)
MVAPICH2-2.3.1
SpectrumMPI-2019.02.07
Intra-node Point-to-Point Performance on OpenPower
Platform: Two nodes of OpenPOWER (POWER9-ppc64le) CPU using Mellanox EDR (MT4121) HCA
Intra-Socket Small Message Latency Intra-Socket Large Message Latency
Intra-Socket Bi-directional BandwidthIntra-Socket Bandwidth
0.22us0
20
40
60
80
100
4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(u
s)
MVAPICH2-2.3.1
SpectrumMPI-2019.02.07
0
10000
20000
30000
40000
1 8 64 512 4K 32K 256K 2M
Ban
dw
idth
(M
B/s
) MVAPICH2-2.3.1
SpectrumMPI-2019.02.07
0
20000
40000
1 8 64 512 4K 32K 256K 2M
Bi-
Ban
dw
idth
(M
B/s
) MVAPICH2-2.3.1
SpectrumMPI-2019.02.07
27Network Based Computing Laboratory HPCAC-Swiss (April ’19)
0
0.5
1
1.5
0 1 2 4 8 16 32 64 128256512 1K 2K 4K
Late
ncy
(u
s)
MVAPICH2-2.3
Intra-node Point-to-point Performance on ARM Cortex-A72
0
2000
4000
6000
8000
10000
Ban
dw
idth
(M
B/s
)
MVAPICH2-2.3
0
5000
10000
15000
20000
Bid
irec
tio
nal
Ban
dw
idth MVAPICH2-2.3
Platform: ARM Cortex A72 (aarch64) processor with 64 cores dual-socket CPU. Each socket contains 32 cores.
Small Message Latency Large Message Latency
Bi-directional BandwidthBandwidth
0.27 micro-second
(1 bytes)
0
200
400
600
800
8K 16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(u
s)
MVAPICH2-2.3
28Network Based Computing Laboratory HPCAC-Swiss (April ’19)
0
20
40
60
80
100
120
140
160
MILC Leslie3D POP2 LAMMPS WRF2 LU
Exe
cuti
on
Tim
e in
(s)
Intel MPI 18.1.163
MVAPICH2-X
31%
SPEC MPI 2007 Benchmarks: Broadwell + InfiniBand
MVAPICH2-X outperforms Intel MPI by up to 31%
Configuration: 448 processes on 16 Intel E5-2680v4 (Broadwell) nodes having 28 PPN and interconnected
with 100Gbps Mellanox MT4115 EDR ConnectX-4 HCA
29% 5%
-12%1%
11%
29Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Application Scalability on Skylake and KNL (Stamepede2)MiniFE (1300x1300x1300 ~ 910 GB)
Runtime parameters: MV2_SMPI_LENGTH_QUEUE=524288 PSM2_MQ_RNDV_SHM_THRESH=128K PSM2_MQ_RNDV_HFI_THRESH=128K
0
50
100
150
2048 4096 8192
Exec
uti
on
Tim
e (s
)
No. of Processes (KNL: 64ppn)
MVAPICH2
0
20
40
60
2048 4096 8192
Exec
uti
on
Tim
e (s
)
No. of Processes (Skylake: 48ppn)
MVAPICH2
0
500
1000
1500
48 96 192 384 768No. of Processes (Skylake: 48ppn)
MVAPICH2
NEURON (YuEtAl2012)
Courtesy: Mahidhar Tatineni @SDSC, Dong Ju (DJ) Choi@SDSC, and Samuel Khuvis@OSC ---- Testbed: TACC Stampede2 using MVAPICH2-2.3b
0
1000
2000
3000
4000
64 128 256 512 1024 2048 4096
No. of Processes (KNL: 64ppn)
MVAPICH2
0
500
1000
1500
68 136 272 544 1088 2176 4352
No. of Processes (KNL: 68ppn)
MVAPICH2
0
500
1000
1500
2000
48 96 192 384 768 1536 3072No. of Processes (Skylake: 48ppn)
MVAPICH2
Cloverleaf (bm64) MPI+OpenMP, NUM_OMP_THREADS = 2
30Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• MPI runtime has many parameters
• Tuning a set of parameters can help you to extract higher performance
• Compiled a list of such contributions through the MVAPICH Website– http://mvapich.cse.ohio-state.edu/best_practices/
• Initial list of applications– Amber
– HoomDBlue
– HPCG
– Lulesh
– MILC
– Neuron
– SMG2000
– Cloverleaf
– SPEC (LAMMPS, POP2, TERA_TF, WRF2)
• Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu.
• We will link these results with credits to you.
Applications-Level Tuning: Compilation of Best Practices
31Network Based Computing Laboratory HPCAC-Swiss (April ’19)
0
2000
4000
6000
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
1000
2000
3000
4000
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
10
20
300 1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
Late
ncy
(u
s)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3
MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design
1.85us11X
32Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768
MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Ave
rage
Tim
e St
eps
per
se
con
d (
TPS)
Number of Processes
MV2 MV2+GDR
0
500
1000
1500
2000
2500
3000
3500
4 8 16 32Ave
rage
Tim
e St
eps
pe
r se
con
d (
TPS)
Number of Processes
64K Particles 256K Particles
2X2X
33Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96No
rmal
ized
Exe
cuti
on
Tim
e
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
0
0.2
0.4
0.6
0.8
1
1.2
4 8 16 32
No
rmal
ized
Exe
cuti
on
Tim
e
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data
Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content
/tasks/operational/meteoSwiss/
34Network Based Computing Laboratory HPCAC-Swiss (April ’19)
0
5
10
15
20
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
Late
ncy
(u
s)
Message Size (Bytes)
INTRA-NODE LATENCY (SMALL)
Intra-Socket
Inter-Socket
Device-to-Device Performance on OpenPOWER (NVLink2 + Volta)
05
1015202530
1 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4MB
and
wid
th (
GB
/sec
)
Message Size (Bytes)
INTER-NODE BANDWIDTH
Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect
0
100
200
300
400
500
16K 32K 64K 128K256K512K 1M 2M 4M
Late
ncy
(u
s)
Message Size (Bytes)
INTRA-NODE LATENCY (LARGE)
Intra-Socket
Inter-Socket
0
5
10
15
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
Late
ncy
(u
s)
Message Size (Bytes)
INTER-NODE LATENCY (SMALL)
0
100
200
300
400
Late
ncy
(u
s)
Message Size (Bytes)
INTER-NODE LATENCY (LARGE)
Intra-node Bandwidth: 70.4 GB/sec for 128MB
(via NVLINK2)Intra-node Latency: 5.36 us (without GDRCopy)
Inter-node Latency: 5.66 us (without GDRCopy) Inter-node Bandwidth: 23.7 GB/sec (2 port EDR)Available since MVAPICH2-GDR 2.3a
0
20
40
60
80
1 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4MB
and
wid
th (
GB
/sec
)
Message Size (Bytes)
INTRA-NODE BANDWIDTH
Intra-Socket
Inter-Socket
35Network Based Computing Laboratory HPCAC-Swiss (April ’19)
MVAPICH2-GDR: Enhanced Derived Datatype Processing• Kernel-based and GDRCOPY-based one-shot packing for inter-socket and inter-node communication
• Zero-copy (packing-free) for GPUs with peer-to-peer direct access over PCIe/NVLink
0
5
10
15
20
25
[6, 8,8,8,8] [6, 8,8,8,16] [6, 8,8,16,16] [6, 16,16,16,16]
MILC
Spee
du
p
Problem size
GPU-based DDTBench mimics MILC communication kernel
OpenMPI 4.0.0 MVAPICH2-GDR 2.3 MVAPICH2-GDR-Next
Platform: Nvidia DGX-2 system
(16 NVIDIA Volta GPUs connected with NVSwitch), CUDA 9.2
0
1
2
3
4
5
8 16 32 64
Exec
uti
on
Tim
e (s
)
Number of GPUs
Communication Kernel of COSMO Model
MVAPICH2-GDR 2.3 MVAPICH2-GDR-Next
Platform: Cray CS-Storm
(8 NVIDIA Tesla K80 GPUs per node), CUDA 8.0
Improved ~10X
36Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness
• HiDL Project – High-Performance Deep Learning
• HiBD Project – High-Performance Big Data Analytics Library
• Commercial Support from X-ScaleSolutions
• Conclusions and Q&A
Presentation Overview
37Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Convergent Software Stacks for HPC, Big Data and Deep Learning
38Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• Efficient Allreduce is crucial for
Horovod’s overall training performance
– Both MPI and NCCL designs are available
• We have evaluated Horovod extensively
and compared across a wide range of
designs using gRPC and gRPC extensions
• MVAPICH2-GDR achieved up to 90%
scaling efficiency for ResNet-50 Training
on 64 Pascal GPUs
Scalable TensorFlow using Horovod, MPI, and NCCL
0
100
200
300
400
500
600
700
800
900
1000
1 2 4 8 16
Ima
ges/
seco
nd (
Hig
her
is b
ette
r)
No. of GPUs
Horovod-MPI Horovod-NCCL2 Horovod-MPI-Opt (Proposed) Ideal
1
4
16
64
256
1024
4096
16384
1 2 4 8 16 32 64
Imag
es/s
eco
nd (
Hig
her
is
bet
ter)
No. of Nodes (GPUs)
Horovod-NCCL2 Horovod-MPI-Opt Ideal
Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, (To be presented) CCGrid ‘19. https://arxiv.org/abs/1810.11112
39Network Based Computing Laboratory HPCAC-Swiss (April ’19)
MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)
• Optimized designs in upcoming MVAPICH2-GDR offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)
1
10
100
1000
10000
Late
ncy
(u
s)
Message Size (Bytes)
MVAPICH2-GDR-Next NCCL-2.3
~1.7X better
Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2
0
10
20
30
40
50
60
8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
64
K
12
8K
Late
ncy
(u
s)
Message Size (Bytes)
MVAPICH2-GDR-Next NCCL-2.3
~2.5X better
40Network Based Computing Laboratory HPCAC-Swiss (April ’19)
MVAPICH2-GDR vs. NCCL2 – ResNet-50 Training
• ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (8 Volta GPUs)
0
500
1000
1500
2000
2500
3000
1 2 4 8
Imag
e p
er s
eco
nd
Number of GPUs
NCCL-2.3 MVAPICH2-GDR-Next
7.5% higher
Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2
75
80
85
90
95
100
1 2 4 8
Scal
ing
Effi
cien
cy (
%)
Number of GPUs
NCCL-2.3 MVAPICH2-GDR-Next
Scaling Efficiency =Actual throughput
Ideal throughput at scale× 100%
41Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• Increasing trend to provide container support for MPI Libraries
• Ease of build
• Portability
• Reproducibility
• MVAPICH2-GDR 2.3.1 provides container (Docker) support
• More details are available in the MVAPICH2-GDR User Guide • http://mvapich.cse.ohio-state.edu/userguide/gdr/
• Synergistic with the HPC-Container-Maker and hpccm efforts by
NVIDIA• (https://github.com/NVIDIA/hpc-container-maker)
Container Support
42Network Based Computing Laboratory HPCAC-Swiss (April ’19)
0
5000
10000
15000
20000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
Docker Native
02000400060008000
1000012000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
Docker Native
1
10
100
1000
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Late
ncy
(u
s)
Message Size (Bytes)
GPU-GPU Inter-node Latency
Docker Native
MVAPICH2-GDR-2.3.1Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
MVAPICH2-GDR on Container with Negligible Overhead
43Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• High-Performance Design of TensorFlow over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand support at the verbs-level for gRPC and TensorFlow
– RDMA-based data communication
– Adaptive communication protocols
– Dynamic message chunking and accumulation
– Support for RDMA device selection
– Easily configurable for different protocols (native InfiniBand and IPoIB)
• Current release: 0.9.1
– Based on Google TensorFlow 1.3.0
– Tested with
• Mellanox InfiniBand adapters (e.g., EDR)
• NVIDIA GPGPU K80
• Tested with CUDA 8.0 and CUDNN 5.0
– http://hidl.cse.ohio-state.edu
RDMA-TensorFlow Distribution
OpenPOWER support
will be coming soon
44Network Based Computing Laboratory HPCAC-Swiss (April ’19)
0
50
100
150
200
16 32 64
Imag
es /
Sec
on
d
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
Performance Benefit for RDMA-TensorFlow (Inception3)
• TensorFlow Inception3 performance evaluation on an IB EDR cluster
– Up to 20% performance speedup over Default gRPC (IPoIB) for 8 GPUs
– Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs
– Up to 37% performance speedup over Default gRPC (IPoIB) for 24 GPUs
4 Nodes (8 GPUS) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS)
0
200
400
600
16 32 64
Imag
es /
Sec
on
d
Batch Size
gRPPC (IPoIB-100Gbps)Verbs (RDMA-100Gbps)MPI (RDMA-100Gbps)AR-gRPC (RDMA-100Gbps)
0
100
200
300
400
16 32 64Im
ages
/ S
eco
nd
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
45Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Using HiDL Packages for Deep Learning on Existing HPC Infrastructure
Hadoop Job
46Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness
• HiDL Project – High-Performance Deep Learning
• HiBD Project – High-Performance Big Data Analytics Library
• Commercial Support from X-ScaleSolutions
• Conclusions and Q&A
Presentation Overview
47Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• HDFS, MapReduce, Spark
Data Management and Processing on Modern Datacenters
48Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Convergent Software Stacks for HPC, Big Data and Deep Learning
49Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• RDMA for Apache Spark
• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache Kafka
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 305 organizations from 35 countries
• More than 29,600 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Also run on Ethernet
Available for x86 and OpenPOWER
Support for Singularity and Docker
50Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well
as performance. This mode is enabled by default in the package.
• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-
memory and obtain as much performance benefit as possible.
• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst
buffer design is hosted by Memcached servers, each of which has a local SSD.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top
of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-
L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x
51Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• High-Performance Design of Hadoop 3.x over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS
– Enhanced HDFS with in-memory and heterogeneous storage
– Supports deploying Hadoop with Slurm and PBS in different running modes (HHH and HHH-M)
– On-demand connection setup
• Current release: 0.9.1
– Based on Apache Hadoop 3.0.0
– Compliant with Apache Hadoop 3.0.0 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Different file systems with disks and SSDs and Lustre
• OpenJDK and IBM JDK
RDMA for Apache Hadoop 3.x Distribution (NEW)
http://hibd.cse.ohio-state.edu
52Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• OSU-RI2 Cluster
• The RDMA-IB design improves the job execution time of TestDFSIO by 1.4x – 3.4x compared to
IPoIB (100Gbps)
• For TestDFSIO throughput experiment, the RDMA-IB design has an improvement of 1.2x – 3.1x
compared to IPoIB (100Gbps)
• Support for Erasure Coding being worked out (HPDC ’19, accepted to be presented)
Performance Numbers of RDMA for Apache Hadoop 3.x (NEW)
0
100
200
300
120GB 240GB 360GB
Ex
ecu
tio
nT
ime
(sec
)
TestDFSIO Latency
IPoIB (100Gbps)
RDMA-IB (100Gbps)
0
60
120
180
240
300
120GB 240GB 360GB
Th
rou
gh
put
(MB
ps)
TestDFSIO Throughput
IPoIB (100Gbps)
RDMA-IB (100Gbps)
53Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure
54Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• High-Performance Design of Spark over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark
– RDMA-based data shuffle and SEDA-based shuffle architecture
– Non-blocking and chunk-based data transfer
– Off-JVM-heap buffer management
– Support for OpenPOWER
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.5
– Based on Apache Spark 2.1.0
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms (x86, POWER)
• RAM disks, SSDs, and HDD
– http://hibd.cse.ohio-state.edu
RDMA for Apache Spark Distribution
55Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure
56Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness
• HiDL Project – High-Performance Deep Learning
• HiBD Project – High-Performance Big Data Analytics Library
• Commercial Support from X-ScaleSolutions
• Conclusions and Q&A
Presentation Overview
57Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• Supported through X-ScaleSolutions (http://x-scalesolutions.com)
• Benefits:
– Help and guidance with installation of the library
– Platform-specific optimizations and tuning
– Timely support for operational issues encountered with the library
– Web portal interface to submit issues and tracking their progress
– Advanced debugging techniques
– Application-specific optimizations and tuning
– Obtaining guidelines on best practices
– Periodic information on major fixes and updates
– Information on major releases
– Help with upgrading to the latest release
– Flexible Service Level Agreements
• Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years
Commercial Support for MVAPICH2, HiBD, and HiDL Libraries
58Network Based Computing Laboratory HPCAC-Swiss (April ’19)
• Upcoming Exascale systems need to be designed with a holistic view of HPC,
Big Data, Deep Learning, and Cloud
• Presented an overview of designing convergent software stacks for HPC, Big
Data, and Deep Learning
• Presented solutions enable HPC, Big Data, and Deep Learning communities to
take advantage of current and next-generation systems
Concluding Remarks
59Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Upcoming 7th Annual MVAPICH User Group (MUG) Meeting
• August 19-21, 2019; Columbus, Ohio, USA
• Keynote Talks, Invited Talks, Invited Tutorials by ARM, IBM, Mellanox, Contributed Presentations, Student Poster Presentations, Tutorial on
MVAPICH2, MVAPICH2-X, MVAPICH2-GDR, MVAPICH2-MIC, MVAPICH2-Virt, MVAPICH2-EA, OSU INAM as well as other optimization and tuning
hints.• Keynote Speakers (Confirmed so far)
– Dan Stanzione, Texas Advanced Computing Center (TACC)
• Tutorials (Confirmed so far)
• ARM
• A tutorial from ARM on advanced features of ARM Forge. This
tutorial will present Arm Forge and demonstrate how
performance problems in applications using MVAPICH2 can be
identified and resolved
• IBM
• A tutorial from IBM on advanced features of the OpenPOWER
architecture including CAPI/OpenCAPI and the SNAP framework.
The tutorial will include demos and hands-on sessions
• Mellanox
• A tutorial from Mellanox focusing on the upcoming technologies,
schemes, and techniques being made available in next-generation
Mellanox adapters and switches for Exascale systems.
More details at: http://mug.mvapich.cse.ohio-state.edu
• Tutorials (Confirmed so far)
• OSU
• A tutorial from OSU on advanced features of various MVAPICH2
libraries and how to take advantage of them to get peak
performance for your applications.
• Invited Speakers (Confirmed so far)
– Hyon-Wook Jin, Konkuk University (South Korea)
– Jithin Jose, Microsoft Azure
– Gilad Shainer, Mellanox
– Sameer Shende, Paratools and University of Oregon
– Sayantan Sur, Intel
– Mahidhar Tatineni, San Diego Supercomputing Center (SDSC)
60Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Funding Acknowledgments
Funding Support by
Equipment Support by
61Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientist
– K. Hamidouche
– S. Sur
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– A. Jain (Ph.D.)
– K. S. Khorassani (Ph.D.)
– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Asst. Professor
– X. Lu
Past Programmers
– D. Bureddy
– J. Perkins
Current Research Specialist
– J. Smith
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc
– A. Ruhela
– K. Manian
Current Students (Undergraduate)
– V. Gangal (B.S.)
– M. Haupt (B.S.)
– N. Sarkauskas (B.S.)
– A. Yeretzian (B.S.)
Past Research Specialist
– M. Arnold
Current Research Scientist
– H. Subramoni
62Network Based Computing Laboratory HPCAC-Swiss (April ’19)
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/