software libraries and middleware for exascale systems · 2019. 12. 2. · benchmarks and...

74
Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale Systems Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Keynote Talk at Bench ’19 Conference by Follow us on https://twitter.com/mvapich

Upload: others

Post on 27-Aug-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale Systems

Dhabaleswar K. (DK) PandaThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~panda

Keynote Talk at Bench ’19 Conference

by

Follow us on

https://twitter.com/mvapich

Page 2: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

2Network Based Computing Laboratory Bench ‘19

High-End Computing (HEC): PetaFlop to ExaFlop

Expected to have an ExaFlop system in 2020-2021!

100 PFlops in 2017

1 EFlops in 2020-2021?

149PFlops in 2018

Page 3: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

3Network Based Computing Laboratory Bench ‘19

Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!

Increasing Need to Run these applications on the Cloud!!

Page 4: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

4Network Based Computing Laboratory Bench ‘19

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Physical Compute

Page 5: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

5Network Based Computing Laboratory Bench ‘19

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Page 6: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

6Network Based Computing Laboratory Bench ‘19

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Page 7: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

7Network Based Computing Laboratory Bench ‘19

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Spark Job

Hadoop Job Deep LearningJob

Page 8: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

8Network Based Computing Laboratory Bench ‘19

• MVAPICH Project– MPI and PGAS Library with CUDA-Awareness

• HiBD Project– High-Performance Big Data Analytics Library

• HiDL Project– High-Performance Deep Learning

• Public Cloud Deployment– Microsoft-Azure and Amazon-AWS

• Conclusions

Presentation Overview

Page 9: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

9Network Based Computing Laboratory Bench ‘19

Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,050 organizations in 89 countries

– More than 614,000 (> 0.6 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘18 ranking)

• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 5th, 448, 448 cores (Frontera) at TACC

• 8th, 391,680 cores (ABCI) in Japan

• 15th, 570,020 cores (Neurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decadePartner in the TACC Frontera System

Page 10: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

10Network Based Computing Laboratory Bench ‘19

0

100000

200000

300000

400000

500000

600000Se

p-04

Feb-

05Ju

l-05

Dec-

05M

ay-0

6O

ct-0

6M

ar-0

7Au

g-07

Jan-

08Ju

n-08

Nov

-08

Apr-

09Se

p-09

Feb-

10Ju

l-10

Dec-

10M

ay-1

1O

ct-1

1M

ar-1

2Au

g-12

Jan-

13Ju

n-13

Nov

-13

Apr-

14Se

p-14

Feb-

15Ju

l-15

Dec-

15M

ay-1

6O

ct-1

6M

ar-1

7Au

g-17

Jan-

18Ju

n-18

Nov

-18

Apr-

19

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9

MV2

-GD

R 2.

0b

MV2

-MIC

2.0

MV2

-GD

R 2

.3.2

MV2

-X2.

3rc

2

MV2

Virt

2.2

MV2

2.3

.2

OSU

INAM

0.9

.3

MV2

-Azu

re 2

.3.2

MV2

-AW

S2.

3

MVAPICH2 Release Timeline and Downloads

Page 11: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

11Network Based Computing Laboratory Bench ‘19

Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-

Awareness

Remote Memory Access

I/O and

File Systems

Fault

ToleranceVirtualization Active

MessagesJob Startup

Introspection & Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC SRD UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

MemoryCMA IVSHMEM

Modern Features

Optane* NVLink CAPI*

* Upcoming

XPMEM

Page 12: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

12Network Based Computing Laboratory Bench ‘19

MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE

MVAPICH2-X

MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Page 13: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

13Network Based Computing Laboratory Bench ‘19

Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Convergent Software Stacks for HPC, Big Data and Deep Learning

Page 14: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

14Network Based Computing Laboratory Bench ‘19

• Message Passing Interface (MPI) is the common programming model in scientific computing

• Has 100’s of APIs and Primitives (Point-to-point, RMA, Collectives, Datatypes, …)

• Multiple challenges for MPI developers, users, managers of HPC centers• How to optimize the designs of these APIs on various hardware platforms and configurations?

• Designers and developers

• Comparing performance of an MPI library (at the API-level) across various platforms and configurations?• Designers, developers and users

• How to compare the performance of multiple MPI libraries (at the API-level) on a given platform and across platforms?

• Procurement decision by managers

• How to correlate the performance from the micro-benchmark level to the overall application level?• Application developers and users, also beneficial for co-deigns

Need for Micro-Benchmarks to Design and Evaluate Programming Models

Page 15: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

15Network Based Computing Laboratory Bench ‘19

• Available since 2004 (https://mvapich.cse.ohio-state.edu/benchmarks)

• Suite of microbenchmarks to study communication performance of various programming models

• Benchmarks available for the following programming models– Message Passing Interface (MPI)

– Partitioned Global Address Space (PGAS)

• Unified Parallel C (UPC)

• Unified Parallel C++ (UPC++)

• OpenSHMEM

• Benchmarks available for multiple accelerator based architectures– Compute Unified Device Architecture (CUDA)

– OpenACC Application Program Interface

• Part of various national resource procurement suites like NERSC-8 / Trinity Benchmarks

• Continuing to add support for newer primitives and features

OSU Micro-Benchmarks (OMB)

Page 16: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

16Network Based Computing Laboratory Bench ‘19

• Host-Based– Point-to-point

– Collectives• Blocking and Non-Blocking

• Job-startup

• GPU-Based– CUDA-aware

• Point-to-point: Device-to-Device (DD), Device-to-Host (DH), Host-to-Device (HD)

• Collectives

– Managed Memory• Point-to-point: Managed-Device-to-Managed-Device (MD-MD)

OSU Micro-Benchmarks (MPI): Examples and Capabilities

Page 17: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

17Network Based Computing Laboratory Bench ‘19

One-way Latency: MPI over IB with MVAPICH2

00.20.40.60.8

11.21.41.61.8 Small Message Latency

Message Size (bytes)

Late

ncy

(us)

1.111.19

1.011.15

1.041.1

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switchConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

0

20

40

60

80

100

120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDROmni-PathConnectX-6 HDR

Large Message Latency

Message Size (bytes)

Late

ncy

(us)

Page 18: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

18Network Based Computing Laboratory Bench ‘19

Bandwidth: MPI over IB with MVAPICH2

0

5000

10000

15000

20000

25000

30000 Unidirectional Bandwidth

Band

wid

th

(MBy

tes/

sec)

Message Size (bytes)

12,590

3,3736,356

12,08312,366

24,532

0

10000

20000

30000

40000

50000

60000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDROmni-PathConnectX-6 HDR

Bidirectional Bandwidth

Band

wid

th

(MBy

tes/

sec)

Message Size (bytes)

21,22712,161

21,983

6,228

48,027

24,136

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switchConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

Page 19: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

19Network Based Computing Laboratory Bench ‘19

0

0.2

0.4

0.6

0.8

4 8 16 32 64 128 256 512 1K 2K

Late

ncy

(us)

MVAPICH2-X 2.3.2SpectrumMPI-10.3.0.01

0.25us

Intra-node Point-to-Point Performance on OpenPOWER

Platform: Two nodes of OpenPOWER (Power9-ppc64le) CPU using Mellanox EDR (MT4121) HCA

Intra-Socket Small Message Latency Intra-Socket Large Message Latency

Intra-Socket Bi-directional BandwidthIntra-Socket Bandwidth

0

100

200

300

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

MVAPICH2-X 2.3.2SpectrumMPI-10.3.0.01

0

10000

20000

30000

40000

1 8 64 512 4K 32K 256K 2M

Band

wid

th (M

B/s) MVAPICH2-X 2.3.2

SpectrumMPI-10.3.0.01

0

10000

20000

30000

40000

1 8 64 512 4K 32K 256K 2M

Band

wid

th (M

B/s)

MVAPICH2-X 2.3.2SpectrumMPI-10.3.0.01

Page 20: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

20Network Based Computing Laboratory Bench ‘19

0

0.5

1

1.5

2

2.5

3

0 2 8 32

Late

ncy

(us)

Message Size (Bytes)

Latency - Small Messages

MVAPICH2-X

OpenMPI+UCX

0

5

10

15

20

25

30

35

128 512 2048 8192

Late

ncy

(us)

Message Size (Bytes)

Latency - Medium Messages

MVAPICH2-X

OpenMPI+UCX

0

500

1000

1500

2000

2500

32K 128K 512K 2M

Late

ncy

(us)

Message Size (Bytes)

Latency - Large Messages

MVAPICH2-X

OpenMPI+UCX

0

100

200

300

400

1 4 16 64

Band

wid

th (

MB/

s)

Message Size (Bytes)

Bandwidth - Small Messages

MVAPICH2-X

OpenMPI+UCX

0

2000

4000

6000

8000

10000

256 1K 4K 16K

Band

wid

th (

MB/

s)

Message Size (Bytes)

Bandwidth – Medium Messages

MVAPICH2-X

OpenMPI+UCX

0

2000

4000

6000

8000

10000

12000

64K 256K 1M 4M

Band

wid

th (

MB/

s)

Message Size (Bytes)

Bandwidth - Large Messages

MVAPICH2-X

OpenMPI+UCX

Point-to-point: Latency & Bandwidth (Inter-socket) on ARM

3.5x better 8.3x better

5x better

Page 21: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

21Network Based Computing Laboratory Bench ‘19

• Host-Based– Point-to-point

– Collectives• Blocking and Non-Blocking

• Job-startup

• GPU-Based– CUDA-aware

• Point-to-point: Device-to-Device (DD), Device-to-Host (DH), Host-to-Device (HD)

• Collectives

– Managed Memory• Point-to-point: Managed-Device-to-Managed-Device (MD-MD)

OSU Micro-Benchmarks (MPI): Examples and Capabilities

Page 22: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

22Network Based Computing Laboratory Bench ‘19

MPI_Allreduce on KNL + Omni-Path (10,240 Processes)

0

50

100

150

200

250

300

4 8 16 32 64 128 256 512 1024 2048 4096

Late

ncy

(us)

Message SizeMVAPICH2 MVAPICH2-OPT IMPI

0200400600800

100012001400160018002000

8K 16K 32K 64K 128K 256KMessage Size

MVAPICH2 MVAPICH2-OPT IMPI

OSU Micro Benchmark 64 PPN

2.4X

• For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4XM. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, SuperComputing '17. Available since MVAPICH2-X 2.3b

Page 23: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

23Network Based Computing Laboratory Bench ‘19

Shared Address Space (XPMEM)-based Collectives Design

1

10

100

1000

10000

100000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(us)

Message Size

MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-X-2.3rc1

OSU_Allreduce (Broadwell 256 procs)

• “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2

• Offloaded computation/communication to peers ranks in reduction collective operation

• Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce

73.2

1.8X

1

10

100

1000

10000

100000

16K 32K 64K 128K 256K 512K 1M 2M 4MMessage Size

MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-2.3rc1

OSU_Reduce (Broadwell 256 procs)

4X

36.1

37.9

16.8

J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.

Available since MVAPICH2-X 2.3rc1

Page 24: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

24Network Based Computing Laboratory Bench ‘19

0123456789

4 8 16 32 64 128

Pure

Com

mun

icat

ion

Late

ncy

(us)

Message Size (Bytes)

1 PPN*, 8 NodesMVAPICH2

MVAPICH2-SHArP

05

101520253035404550

4 8 16 32 64 128Com

mun

icat

ion-

Com

puta

tion

Ove

rlap

(%)

Message Size (Bytes)

1 PPN, 8 NodesMVAPICH2

MVAPICH2-SHArP

Evaluation of SHArP based Non Blocking Allreduce

MPI_Iallreduce Benchmark

2.3x

*PPN: Processes Per Node

• Complete offload of Allreduce collective operation to Switch helps to have much higher overlap of communication and computation

Lower is Better

Hig

her i

s Be

tter

Available since MVAPICH2 2.3a

Page 25: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

25Network Based Computing Laboratory Bench ‘19

• Host-Based– Point-to-point

– Collectives• Blocking and Non-Blocking

• Job-startup

• GPU-Based– CUDA-aware

• Point-to-point: Device-to-Device (DD), Device-to-Host (DH), Host-to-Device (HD)

– Managed Memory• Point-to-point: Managed-Device-to-Managed-Device (MD-MD)

OSU Micro-Benchmarks (MPI): Examples and Capabilities

Page 26: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

26Network Based Computing Laboratory Bench ‘19

Startup Performance on TACC Frontera

• MPI_Init takes 3.9 seconds on 57,344 processes on 1,024 nodes• MPI_Init takes 195 seconds on 229376 processes on 4096 nodes while MVAPICH2 takes 31 seconds• All numbers reported with 56 processes per node

4.5s3.9s

New designs available in MVAPICH2-2.3.2

0

1000

2000

3000

4000

5000

56 112 224 448 896 1792 3584 7168 14336 28672 57344

Tim

e Ta

ken

(Mill

isec

onds

)

Number of Processes

MPI_Init on Frontera

Intel MPI 2019

MVAPICH2 2.3.2

Page 27: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

27Network Based Computing Laboratory Bench ‘19

• Host-Based– Point-to-point

– Collectives• Blocking and Non-Blocking

• Job-startup

• GPU-Based– CUDA-aware

• Point-to-point: Device-to-Device (DD), Device-to-Host (DH) and Host-to-Device (HD)

• Collectives

– Managed Memory• Point-to-point: Managed-Device-to-Managed-Device (MD-MD)

OSU Micro-Benchmarks (MPI): Examples and Capabilities

Page 28: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

28Network Based Computing Laboratory Bench ‘19

CPU CPUQPI

GPU

PCIe

GPU

GPU

CPU

GPU

IB

Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host

7. Inter-Node GPU-Host6. Inter-Socket GPU-Host

Memory buffers

8. Inter-Node GPU-GPU with IB adapter on remote socketand more . . .

• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

Optimizing MPI Data Movement on GPU Clusters

Page 29: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

29Network Based Computing Laboratory Bench ‘19

At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

Page 30: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

30Network Based Computing Laboratory Bench ‘19

0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR 2.3

MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design (D-D)

1.85us11X

Presenter
Presentation Notes
This one can be updated with the latest version of GDR
Page 31: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

31Network Based Computing Laboratory Bench ‘19

D-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)

Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect

Intra-node Bandwidth: 62.79 GB/sec for 4MB (via NVLINK2)

Intra-node Latency: 0.90 us (with GDRCopy)

Inter-node Latency: 2.04 us (with GDRCopy) Inter-node Bandwidth: 12.03 GB/sec (2 port EDR)Available since MVAPICH2-GDR 2.3.2

0

2

4

6

8

1 2 4 8 16 32 64 128256512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

INTRA-NODE LATENCY (SMALL)

Intra-Socket Inter-Socket

0

20

40

60

80

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(us)

Message Size (Bytes)

INTRA-NODE LATENCY (LARGE)

Intra-Socket Inter-Socket

0

20

40

60

80

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4MBa

ndw

idth

(GB/

sec)

Message Size (Bytes)

INTRA-NODE BANDWIDTH

Intra-Socket Inter-Socket

02468

10

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

INTER-NODE LATENCY (SMALL)

0100200300400500

16K 32K 64K 128K256K512K 1M 2M 4M

Late

ncy

(us)

Message Size (Bytes)

INTER-NODE LATENCY (LARGE)

0

5

10

15

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/Se

c)

Message Size (Bytes)

INTER-NODE BANDWIDTH

Page 32: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

32Network Based Computing Laboratory Bench ‘19

D-to-H & H-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)

Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect

Intra-node D-H Bandwidth: 16.70 GB/sec for 2MB (via NVLINK2)

Intra-node D-H Latency: 0.49 us (with GDRCopy)

Intra-node H-D Latency: 0.49 us (with GDRCopy)Intra-node H-D Bandwidth: 26.09 GB/sec for 2MB (via NVLINK2)Available since MVAPICH2-GDR 2.3a

0204060

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KLa

tenc

y (u

s)

Message Size (Bytes)

D-H INTRA-NODE LATENCY (SMALL)

Spectrum MPI MV2-GDR

0

100

200

300

400

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(us)

Message Size (Bytes)

D-H INTRA-NODE LATENCY (LARGE)

Spectrum MPI MV2-GDR

05

101520

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)

Message Size (Bytes)

D-H INTRA-NODE BW

Spectrum MPI MV2-GDR

0204060

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KLate

ncy

(us)

Message Size (Bytes)

H-D INTRA-NODE LATENCY (SMALL)

Spectrum MPI MV2-GDR

0100200300400

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(us)

Message Size (Bytes)

H-D INTRA-NODE LATENCY (LARGE)

Spectrum MPI MV2-GDR

0

10

20

30

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)

Message Size (Bytes)

H-D INTRA-NODE BW

Spectrum MPI MV2-GDR

Page 33: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

33Network Based Computing Laboratory Bench ‘19

MVAPICH2-GDR: Enhanced MPI_Allreduce at Scale• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs

0

1

2

3

4

5

6

32M 64M 128M 256M

Band

wid

th (G

B/s)

Message Size (Bytes)

Bandwidth on 1,536 GPUs

MVAPICH2-GDR-2.3.2 NCCL 2.4

1.7X better

050

100150200250300350400450

4 16 64 256 1K 4K 16K

Late

ncy

(us)

Message Size (Bytes)

Latency on 1,536 GPUs

MVAPICH2-GDR-2.3.2 NCCL 2.4

1.6X better

Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect

0

2

4

6

8

10

24 48 96 192 384 768 1536

Band

wid

th (G

B/s)

Number of GPUs

128MB Message

SpectrumMPI 10.2.0.11 OpenMPI 4.0.1

NCCL 2.4 MVAPICH2-GDR-2.3.2

1.7X better

Page 34: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

34Network Based Computing Laboratory Bench ‘19

• Host-Based– Point-to-point

– Collectives• Blocking and Non-Blocking

• Job-startup

• GPU-Based– CUDA-aware

• Point-to-point: Device-to-Device (DD), Device-to-Host (DH) and Host-to-Device (HD)

• Collectives

– Managed Memory• Point-to-point: Managed-Device-to-Managed-Device (MD-MD)

OSU Micro-Benchmarks (MPI): Examples and Capabilities

Page 35: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

35Network Based Computing Laboratory Bench ‘19

Managed Memory Performance (Inter-node x86) with MVAPICH2-GDR

Latency MD MD Bandwidth MD MD

Bi-Bandwidth MD MD

Page 36: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

36Network Based Computing Laboratory Bench ‘19

Managed Memory Performance (OpenPOWER Intra-node)

Latency MD MD Bandwidth MD MD

Bi-Bandwidth MD MD

Page 37: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

37Network Based Computing Laboratory Bench ‘19

• MVAPICH Project– MPI and PGAS Library with CUDA-Awareness

• HiBD Project– High-Performance Big Data Analytics Library

• HiDL Project– High-Performance Deep Learning

• Public Cloud Deployment– Microsoft-Azure and Amazon-AWS

• Conclusions

Presentation Overview

Page 38: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

38Network Based Computing Laboratory Bench ‘19

• Substantial impact on designing and utilizing data management and processing systems in multiple tiers

– Front-end data accessing and serving (Online)• Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)• HDFS, MapReduce, Spark

Data Management and Processing on Modern Datacenters

Page 39: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

39Network Based Computing Laboratory Bench ‘19

Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Convergent Software Stacks for HPC, Big Data and Deep Learning

Page 40: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

40Network Based Computing Laboratory Bench ‘19

• RDMA for Apache Spark

• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

• RDMA for Apache Kafka

• RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 315 organizations from 35 countries

• More than 31,600 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCEAlso run on Ethernet

Available for x86 and OpenPOWER

Support for Singularity and Docker

Page 41: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

41Network Based Computing Laboratory Bench ‘19

• Hadoop Benchmarks– DFSIO, Terasort, Teragen, HiBench, …

• PUMA

• YCSB

• Spark Benchmarks

• GroupBy, PageRank, K-means, …

• BigData Bench

Current set of Benchmarks for Big Data

Page 42: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

42Network Based Computing Laboratory Bench ‘19

• The current benchmarks provide some performance behavior

• However, do not provide any information to the designer/developer on:– What is happening at the lower-layer?

– Where the benefits are coming from?

– Which design is leading to benefits or bottlenecks?

– Which component in the design needs to be changed and what will be its impact?

– Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer?

Are the Current Benchmarks Sufficient for Big Data?

Page 43: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

43Network Based Computing Laboratory Bench ‘19

Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies(InfiniBand, 1/10/40/100 GigE

and Intelligent NICs)

Storage Technologies(HDD, SSD, NVM, and NVMe-SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-PointCommunication

QoS & Fault Tolerance

Threaded Modelsand Synchronization

Performance TuningI/O and File Systems

Virtualization (SR-IOV)

Benchmarks

RDMA Protocols

Challenges in Benchmarking of Optimized Designs

Current Benchmarks

No Benchmarks

Correlation?

Page 44: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

44Network Based Computing Laboratory Bench ‘19

Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies(InfiniBand, 1/10/40/100 GigE

and Intelligent NICs)

Storage Technologies(HDD, SSD, NVM, and NVMe-SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-PointCommunication

QoS & Fault Tolerance

Threaded Modelsand Synchronization

Performance TuningI/O and File Systems

Virtualization (SR-IOV)

Benchmarks

RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications

Applications-Level Benchmarks

Micro-Benchmarks

Page 45: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

45Network Based Computing Laboratory Bench ‘19

• Evaluate the performance of standalone HDFS

• Five different benchmarks– Sequential Write Latency (SWL)

– Sequential or Random Read Latency (SRL or RRL)

– Sequential Write Throughput (SWT)

– Sequential Read Throughput (SRT)

– Sequential Read-Write Throughput (SRWT)

OSU HiBD Micro-Benchmark (OHB) Suite - HDFS

Benchmark File Name

File Size HDFSParameter

Readers Writers Random/Sequential

Read

Seek Interval

SWL √ √ √

SRL/RRL √ √ √ √ √ (RRL)

SWT √ √ √

SRT √ √ √

SRWT √ √ √ √

N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December 2012

Page 46: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

46Network Based Computing Laboratory Bench ‘19

• Evaluate the performance of stand-alone MapReduce

• Does not require or involve HDFS or any other distributed file system

• Models shuffle data patterns in real-workload Hadoop application workloads

• Considers various factors that influence the data shuffling phase– underlying network configuration, number of map and reduce tasks, intermediate shuffle data

pattern, shuffle data size etc.

• Two different micro-benchmarks based on generic intermediate shuffle patterns– MR-AVG: intermediate data is evenly distributed (or approx. equal) among reduce tasks

• MR-RR i.e., round-robin distribution and MR-RAND i.e., pseudo-random distribution

– MR-SKEW: intermediate data is unevenly distributed among reduce tasks• Total number of shuffle key/value pairs, max% per reducer, min% per reducer to configure skew

OSU HiBD Micro-Benchmark (OHB) Suite - MapReduce

D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks, BPOE-5 (2014)

D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters, The Journal of Supercomputing (2016)

Presenter
Presentation Notes
MR-RR: In this micro-benchmark, the map output key/value pairs are distributed among the reducers in a round-robin fashion, making sure that each reducer gets the same number of intermediate key/value pairs. MR-RAND: In this micro-benchmark, we randomly distribute the intermediate key/value pairs among the reduce tasks. MR-SKEW: In this micro-benchmark, we distribute the intermediate key/value pairs unevenly among the reducers, to represent the SKEW pattern, referred to as MR-SKEW. We employ two user-defined skew distribution parameters: (1) max%, which is the maximum percentage, and (2) min%, which is the minimum percentage, of the total number of key/value pairs to be distributed to a given reducer, at each map task. The key/value pairs are distributed among the reducers such that min% ≤ pairs_per_reducer ≤max%.
Page 47: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

47Network Based Computing Laboratory Bench ‘19

• Two different micro-benchmarks to evaluate the performance of standalone Hadoop RPC– Latency: Single Server, Single Client

– Throughput: Single Server, Multiple Clients

• A simple script framework for job launching and resource monitoring

• Calculates statistics like Min, Max, Average

• Network configuration, Tunable parameters, DataType, CPU Utilization

OSU HiBD Micro-Benchmark (OHB) Suite - RPC

Component Network Address

Port Data Type Min Msg Size Max Msg Size No. of Iterations Handlers Verbose

lat_client √ √ √ √ √ √ √

lat_server √ √ √ √

Component Network Address

Port Data Type Min Msg Size Max Msg Size No. of Iterations No. of Clients Handlers Verbose

thr_client √ √ √ √ √ √ √

thr_server √ √ √ √ √ √

X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks, Int'l Workshop on Big Data Benchmarking (WBDB '13), July 2013

Page 48: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

48Network Based Computing Laboratory Bench ‘19

• Evaluates the performance of stand-alone Memcached in different modes

• Default API Latency benchmarks for Memcached in-memory mode– SET Micro-benchmark: Micro-benchmark for memcached set operations

– GET Micro-benchmark: Micro-benchmark for memcached get operations

– MIX Micro-benchmark: Micro-benchmark for a mix of memcached set/get operations (Read:Write ratio is 90:10)

• Latency benchmarks for Memcached hybrid-memory mode

• Non-Blocking API Latency Benchmark for Memcached (both in-memory and hybrid-memory mode)

• Calculates average latency of Memcached operations in different modes

OSU HiBD Micro-Benchmark (OHB) Suite - Memcached

D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, Benchmarking Key-Value Stores on High-Performance Storage and Interconnects for Web-Scale Workloads, IEEE International Conference on Big Data (IEEE BigData ‘15), Oct 2015

Page 49: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

49Network Based Computing Laboratory Bench ‘19

• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package.

• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-memory and obtain as much performance benefit as possible.

• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.

• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD.

• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.

• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre).

Different Modes of RDMA for Apache Hadoop 2.x

Page 50: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

50Network Based Computing Laboratory Bench ‘19

Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

Page 51: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

51Network Based Computing Laboratory Bench ‘19

• High-Performance Design of Spark over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark

– RDMA-based data shuffle and SEDA-based shuffle architecture

– Non-blocking and chunk-based data transfer

– Off-JVM-heap buffer management

– Support for OpenPOWER

– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)

• Current release: 0.9.5

– Based on Apache Spark 2.1.0

– Tested with• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms (x86, POWER)

• RAM disks, SSDs, and HDD

– http://hibd.cse.ohio-state.edu

RDMA for Apache Spark Distribution

Page 52: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

52Network Based Computing Laboratory Bench ‘19

Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

Page 53: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

53Network Based Computing Laboratory Bench ‘19

• MVAPICH Project– MPI and PGAS Library with CUDA-Awareness

• HiBD Project– High-Performance Big Data Analytics Library

• HiDL Project– High-Performance Deep Learning

• Public Cloud Deployment– Microsoft-Azure and Amazon-AWS

• Conclusions

Presentation Overview

Page 54: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

54Network Based Computing Laboratory Bench ‘19

• Deep Learning frameworks are a different game altogether

– Unusually large message sizes (order of megabytes)

– Most communication based on GPU buffers

• Existing State-of-the-art– cuDNN, cuBLAS, NCCL --> scale-up performance

– NCCL2, CUDA-Aware MPI --> scale-out performance• For small and medium message sizes only!

• Proposed: Can we co-design the MPI runtime (MVAPICH2-GDR) and the DL framework (Caffe) to achieve both?

– Efficient Overlap of Computation and Communication

– Efficient Large-Message Communication (Reductions)

– What application co-designs are needed to exploit communication-runtime co-designs?

Deep Learning: New Challenges for MPI Runtimes

Scal

e-up

Per

form

ance

Scale-out PerformanceA. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)

cuDNN

gRPC

Hadoop

MPI

MKL-DNN

DesiredNCCL2

Page 55: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

55Network Based Computing Laboratory Bench ‘19

Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Convergent Software Stacks for HPC, Big Data and Deep Learning

Page 56: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

56Network Based Computing Laboratory Bench ‘19

• CPU-based Deep Learning– Using MVAPICH2-X

• GPU-based Deep Learning– Using MVAPICH2-GDR

High-Performance Deep Learning

Page 57: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

57Network Based Computing Laboratory Bench ‘19

Large-Scale Benchmarking of DL Frameworks on Frontera• TensorFlow, PyTorch, and MXNet are widely used Deep Learning Frameworks

• Optimized by Intel using Math Kernel Library for DNN (MKL-DNN) for Intel processors

• Single Node performance can be improved by running Multiple MPI processes

Impact of Batch Size on Performance for ResNet-50 Performance Improvement using Multiple MPI processes

*Jain et al., “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (in conjunction with SC ’19).

Page 58: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

58Network Based Computing Laboratory Bench ‘19

ResNet-50 using various DL benchmarks on Frontera

• Observed 260K images per sec for ResNet-50 on 2,048 Nodes

• Scaled MVAPICH2-X on 2,048 nodes on Frontera for Distributed Training using TensorFlow

• ResNet-50 can be trained in 7 minutes on 2048 nodes (114,688 cores)

*Jain et al., “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (in conjunction with SC ’19).

Page 59: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

59Network Based Computing Laboratory Bench ‘19

Benchmarking TensorFlow (TF) and PyTorchPyTorch

TensorFlow

• Comprehensive and systematic performance benchmarking

– tf_cnn_becchmarks (TF)

– Horovod benchmark (PyTorch)

• TensorFlow is up to 2.5X fasterthan PyTorch for 128 Nodes.

• TensorFlow: up to 125X speedup for ResNet-152 on 128 nodes

• PyTorch: Scales well but overall lower performance than TensorFlow

*Jain et al., “Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters”, IEEE Cluster ’19.

Page 60: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

60Network Based Computing Laboratory Bench ‘19

• CPU based Hybrid-Parallel (Data Parallelism and Model Parallelism) training on Stampede2

• Benchmark developed for various configuration

– Batch sizes

– No. of model partitions

– No. of model replicas

• Evaluation on a very deep model– ResNet-1000 (a 1,000-layer

model)

Benchmarking HyPar-Flow on Stampede

*Awan et al., “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, arXiv ’19. https://arxiv.org/pdf/1911.05146.pdf

110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2 Cluster)

Page 61: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

61Network Based Computing Laboratory Bench ‘19

• CPU-based Deep Learning– Using MVAPICH2-X

• GPU-based Deep Learning– Using MVAPICH2-GDR

High-Performance Deep Learning

Page 62: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

62Network Based Computing Laboratory Bench ‘19

Distributed Training with TensorFlow and MVAPICH2-GDR• ResNet-50 Training using

TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs!

• 1,281,167 (1.2 mil.) images

• Time/epoch = 3.6 seconds

• Total Time (90 epochs) = 3.6 x 90 = 332 seconds = 5.5 minutes!

050

100150200250300350400

1 2 4 6 12 24 48 96 192 384 768 1536

Imag

e pe

r sec

ond

(Tho

usan

ds)

Number of GPUs

NCCL-2.4 MVAPICH2-GDR-2.3.2

Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2

*We observed errors for NCCL2 beyond 96 GPUs

MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!

ImageNet-1k has 1.2 million images

Presenter
Presentation Notes
Time is calculated based on the images/second performance. It might be different when actual training happens.
Page 63: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

63Network Based Computing Laboratory Bench ‘19

• Near-linear scaling may be achieved by tuning Horovod/MPI• Optimizing MPI/Horovod towards large message sizes for high-resolution images

• Develop a generic Image Segmentation benchmark• Tuned DeepLabV3+ model using the benchmark and Horovod – up to 1.3X better than default

New Benchmark for Image Segmentation on Summit

*Anthony et al., “Scaling Semantic Image Segmentation using Tensorflow and MVAPICH2-GDR on HPC Systems” (Submission under review)

Page 64: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

64Network Based Computing Laboratory Bench ‘19

Using HiDL Packages for Deep Learning on Existing HPC Infrastructure

Hadoop Job

Page 65: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

65Network Based Computing Laboratory Bench ‘19

• MVAPICH Project– MPI and PGAS Library with CUDA-Awareness

• HiBD Project– High-Performance Big Data Analytics Library

• HiDL Project– High-Performance Deep Learning

• Public Cloud Deployment– Microsoft-Azure and Amazon-AWS

• Conclusions

Presentation Overview

Page 66: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

66Network Based Computing Laboratory Bench ‘19

• Released on 08/16/2019

• Major Features and Enhancements

– Based on MVAPICH2-2.3.2

– Enhanced tuning for point-to-point and collective operations

– Targeted for Azure HB & HC virtual machine instances

– Flexibility for 'one-click' deployment– Tested with Azure HB & HC VM instances

MVAPICH2-Azure 2.3.2

Page 67: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

67Network Based Computing Laboratory Bench ‘19

• Released on 08/12/2019

• Major Features and Enhancements

– Based on MVAPICH2-X 2.3

– New design based on Amazon EFA adapter's Scalable Reliable Datagram (SRD) transport protocol

– Support for XPMEM based intra-node communication for point-to-point and collectives

– Enhanced tuning for point-to-point and collective operations

– Targeted for AWS instances with Amazon Linux 2 AMI and EFA support

– Tested with c5n.18xlarge instance

MVAPICH2-X-AWS 2.3

Page 68: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

68Network Based Computing Laboratory Bench ‘19

• Upcoming Exascale systems need to be designed with a holistic view of HPC, Big Data, Deep Learning, and Cloud

• Presented an overview of designing convergent software stacks

• Presented benchmarks and middleware to enable HPC, Big Data, and Deep Learning communities to take advantage of current and next-generation systems

Concluding Remarks

Page 69: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

69Network Based Computing Laboratory Bench ‘19

• Supported through X-ScaleSolutions (http://x-scalesolutions.com)• Benefits:

– Help and guidance with installation of the library

– Platform-specific optimizations and tuning

– Timely support for operational issues encountered with the library

– Web portal interface to submit issues and tracking their progress

– Advanced debugging techniques

– Application-specific optimizations and tuning

– Obtaining guidelines on best practices

– Periodic information on major fixes and updates

– Information on major releases

– Help with upgrading to the latest release

– Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years

Commercial Support for MVAPICH2, HiBD, and HiDL Libraries

Page 70: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

70Network Based Computing Laboratory Bench ‘19

• Has joined the OpenPOWER Consortium as a silver ISV member• Provides flexibility:

– To have MVAPICH2, HiDL and HiBD libraries getting integrated into the OpenPOWER software stack

– A part of the OpenPOWER ecosystem

– Can participate with different vendors for bidding, installation and deployment process

• Introduced two new integrated products with support for OpenPOWER systems (Presented at the OpenPOWER North America Summit)

– X-ScaleHPC

– X-ScaleAI

– Send an e-mail to [email protected] for free trial!!

Silver ISV Member for the OpenPOWER Consortium + Products

Page 71: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

71Network Based Computing Laboratory Bench ‘19

• Presentations at OSU and X-Scale Booth (#2094)– Members of the MVAPICH, HiBD and HiDL members

– External speakers

• Presentations at SC main program (Tutorials and Workshops)

• Presentation at many other booths and satellite events

• Complete details available at http://mvapich.cse.ohio-state.edu/conference/752/talks/

Multiple Events at SC ‘19

Page 72: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

72Network Based Computing Laboratory Bench ‘19

Funding AcknowledgmentsFunding Support by

Equipment Support by

Page 73: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

73Network Based Computing Laboratory Bench ‘19

Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)

– C.-H. Chu (Ph.D.)

– J. Hashmi (Ph.D.)– A. Jain (Ph.D.)

– K. S. Kandadi (M.S.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– R. Biswas (M.S.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– S. Chakraborthy (Ph.D.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– R. Rajachandrasekar (Ph.D.)

– D. Shankar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

– J. Zhang (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

– X. Lu

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– Kamal Raj (M.S.)

– K. S. Khorassani (Ph.D.)– P. Kousha (Ph.D.)

– A. Quentin (Ph.D.)

– B. Ramesh (M. S.)

– S. Xu (M.S.)

– J. Lin

– M. Luo

– E. Mancini

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– M. S. Ghazimeersaeed

– A. Ruhela

– K. ManianCurrent Students (Undergraduate)

– V. Gangal (B.S.)

– N. Sarkauskas (B.S.)

Past Research Specialist– M. Arnold

Current Research Scientist– H. Subramoni– Q. Zhou (Ph.D.)

Page 74: Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

74Network Based Computing Laboratory Bench ‘19

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

Follow us on

https://twitter.com/mvapich