scaling alltoall collective on multi-core...

34
Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala, Dhabaleswar K Panda Department of Computer Science & Engineering The Ohio State University {kumarra, mamidala, panda}@cse.ohio-state.edu

Upload: others

Post on 01-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Scaling Alltoall Collective

on Multi-core Systems

Rahul Kumar, Amith R Mamidala,

Dhabaleswar K Panda

Department of Computer Science & EngineeringThe Ohio State University

{kumarra, mamidala, panda}@cse.ohio-state.edu

Page 2: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Presentation Outline•Introduction

•Motivation & Problem Statement

•Proposed Design

•Performance Evaluation

•Conclusion & Future Work

Page 3: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Introduction

Multi-core architectures being widely used for

high performance computingRanger cluster at TACC has 16 core/node and in total more

than 60,000 cores

Message Passing is the default programming

model for distributed memory systems

MPI provides many communication primitives

MPI Collective operations are widely used in

applications

Page 4: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Introduction

MPI_alltoall is the most intensive collective

and is widely used in many applications such

as CPMD, NAMD, FFT, Matrix transpose.

In MPI_Alltoall every process has a different

data to be sent to every other process.

An efficient alltoall is highly desirable for

multi-core systems as the number of

processes have increased dramatically due

to cheap cost ratio of multi-core architecture

Page 5: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Introduction

24% of the top 500 supercomputers use

InfiniBand as their interconnect (based on

Nov „07 rankings).

Several different implementations of

InfiniBand Network Interfaces

Offload implementation e.g. InfiniHost III(3rd

generation cards from Mellanox)

Onload implementation e.g. Qlogic InfiniPath

Combination of both onload and offload e.g.

ConnectX from Mellanox.

Page 6: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Offload & Onload Architecture

NIC NIC

INFINIBAND

Offload architecture Onload architecture

NIC NIC

INFINIBAND

CoreNode Node Node Node

In an offload architecture, the network processing is offloaded to network

interface. The NIC is able to send message relieving the CPU of communication

In an onload architecture, the CPU is involved in communication in addition to

performing the computation

In onload architecture, the faster CPU is able to speed up the communication.

However, ability to overlap communication with computation is not possible

Page 7: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Characteristics of various

Network Interfaces• Some basic experiments were performed on

various network architectures and the

following observations were made

• The bi-directional bandwidth of onload

network interfaces increases with more

number of cores used to push the data on the

network

• This is shown in the following slides

Page 8: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Bi-directional Bandwidth: InfiniPath (onload)

•Bidirectional Bandwidth increases with more cores used to push data

•In onload interface, more cores help achieve better network utilization

Page 9: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Bi-directional Bandwidth: ConnectX

•A similar trend is also observed for connectX network interfaces

Page 10: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Bi-directional Bandwidth: InfiniHost III (offload)

•However, in Offload network interfaces the bandwidth drops on using more

cores

•We feel this to be due to congestion at the network interface on using many

cores simultaneously

Page 11: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Results from the Experiments

• Depending on the interface

implementation, their characteristics

differ

– Qlogic onload implementations: Using more cores

simultaneously for inter-node communication is

beneficial

– Mellanox offload implementations: Using less

cores at the same time for inter-node communication

is beneficial

– Mellanox ConnectX architecture: Using more cores

simultaneously is beneficial

Page 12: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Presentation Outline•Introduction

•Motivation & Problem Statement

•Proposed Design

•Performance Evaluation

•Conclusion & Future Work

Page 13: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

• To evaluate the performance of existing alltoall

algorithm we conduct the following experiment

• In the experiment alltoall time is measured on a

set of nodes.

• The number of cores per node participating in

alltoall are increased gradually.

Motivation

Page 14: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Motivation

•The alltoall time doubles on doubling the number of cores in the nodes

Page 15: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

What is the problem with the

Algorithm?

•Alltoall between two nodes involves one communication step

Node 1 Node 2

•So on doubling the core alltoall time is almost doubled.

•This is exactly what we obtained from the previous experiment.

•With two cores per node, the number of inter-node communication

by each core increases to two

Cores

Page 16: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Problem Statement

• Can low cost shared memory help to avoid

network transactions?

• Can the performance of alltoall be improved

especially for multi-core systems?

• What algorithms to choose for different

infiniband implementations?

Page 17: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Related WorkThere have been studies that propose a leader-

based hierarchical scheme for other collectives

A leader is chosen on each node

Only the leader is involved in inter-node

communication

The communication takes place in three stages

The cores aggregate data at the leader of the

node

The leader perform inter-node communication

The leader distributes the data to the cores

We implemented the above scheme for Alltoall as

illistrated in the diagram in next slide

Page 18: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Leader-based Scheme for Alltoall

Node 0 Node 1 Node 0 Node 1

GROUP

Node 0 Node 1 Node 0 Node 1

Step 1

Step 2 Step 3

•step 1: all cores send data to the leader

•step 2: the leader performs alltoall with other leader

•step 3: the leader distributes the respective data to other cores

Page 19: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Issues with Leader-based Scheme

• It uses only one core to send the data out on

the network

• Does not take advantage of increase in

bandwidth with the use of more cores to send

the data out of the node

Page 20: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Presentation Outline•Introduction

•Motivation & Problem Statement

•Proposed Design

•Performance Evaluation

•Conclusion & Future Work

Page 21: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

GROUP 2

Proposed Design

21

Node 0 Node 1

Step 1 Step 2

Node 0 Node 1

GROUP 1

Cores

•All the cores take part in the communication

•Each core communicates with one and only one core from other nodes

Node 0 Node 1

•Step 1: Intra-node Communication

•The data destined for other nodes is exchanged among the cores

•The core which communicates with the respective core of the other node

receives the data

•Step 2: Inter-node Communication

•Alltoall is called among each group

Page 22: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Advantages of the Proposed

Scheme

• The scheme takes advantage of low cost

shared memory

• It uses multiple cores to send the data out on

the network, thus achieving better network

utilization

• Each core issues same number of sends as

the leader-based scheme, hence start-up

costs are lower

Page 23: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Presentation Outline•Introduction

•Motivation & Problem Statement

•Proposed Design

•Performance Evaluation

•Conclusion & Future Work

Page 24: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Evaluation Framework

• Testbed– Cluster A: 64 node (512 cores)

• dual 2.33 GHz Intel Xeon “Clovertown” quad-core

• InfiniPath SDR network interface QLE7140

• InfiniHost III DDR network interface card MT25208

– Cluster B: 4 node (32 cores)• dual 2.33 GHz Intel Xeon “Clovertown” quad-core

• Mellanox DDR ConnectX network interface

• Experiments– Alltoall collective time

• Onload InfiniPath network interface

• Offload InfiniHost III network interface

• ConnectX network interface

– CPMD Application performance

Page 25: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Alltoall: InfiniPath

0

5000

10000

15000

20000

25000

30000

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (u

s)

Msg Size

Alltoall Timeoriginal

Leader-based

proposed

•The figure shows the alltoall time for different message size on 512 core system

•Leader-based reduces the alltoall time

•Proposed design gives the best performance on onload network interfaces

Page 26: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Alltoall-InfiniPath: 512 Bytes Message

0

2000

4000

6000

8000

10000

12000

2 4 8 16 32 64

Tim

e (u

s)

# of Nodes

Alltoall Timeoriginal

leader-based

proposed

•The figure shows the alltoall time for 512 Bytes message on varying system size

•The proposed scheme scales much better than other schemes on increase in

system size

Page 27: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Alltoall: InfiniHost III

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (u

s)

Msg Size

Alltoall Timeoriginal

Leader-based

proposed

•The figure shows the performance of the schemes on offload network interfaces

•Leader-based scheme performs best on offload NIC as it avoids congestion.

•This matches our expectations

Page 28: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Alltoall: ConnectX

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Tim

e (u

s)

Msg Size

Alltoall Time

Leader-based

original

proposed

•As seen earlier, bi-directional bandwidth increases with the use of more

cores on ConnectX architecture

•Therfore, the proposed scheme attains the best performance

Page 29: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

CPMD Application

0

20

40

60

80

100

120

140

160

180

200

32-wat si63-10ryd si63-70ryd si63-120ryd

original

Leader-based

proposed

Execution T

ime (

sec)

•CPMD is designed for ab-initio molecular dynamics. CPMD makes

extensive use of alltoall communication.

•Figure shows the performance improvement of CPMD Application on

128 core system

•The proposed design delivers the best execution time

Page 30: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

CPMD Application Performance on

Varying System Size

0

100

200

300

400

500

600

8X8 16X8 32X8 64X8

Tim

e (s

ecs

)

System Size

CPMDoriginal

Leader-based

proposed

•This figure shows the application execution time on different system sizes.

•The reduction in application execution time increases with increasing system

sizes. Proposed design scales very well.

Page 31: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Presentation Outline•Introduction

•Motivation & Problem Statement

•Proposed Design

•Performance Evaluation

•Conclusion & Future Work

Page 32: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

Conclusion & Future Work Interfaces implemented for the same interconnect, exhibit

different network characteristics.

A single collective algorithm does not perform optimally for all network interfaces.

We have proposed an optimized alltoall collective algorithm for multi-core systems connected using modern InfiniBandnetwork interfaces.

The proposed design achieves a reduction in MPI_Alltoalltime by 55% and speeds up the CPMD application by 33%.

We plan to evaluate our designs on 10GigE-based systems.

And also extend the study to other collectives like broadcast and allgather.

Page 33: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

33

Web Pointers

http://nowlab.cse.ohio-state.edu/

MVAPICH Web Page

http://mvapich.cse.ohio-state.edu/

Page 34: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,

34

AcknowledgementsOur research is supported by the following organizations

• Current Funding support by

• Current Equipment support by