collective communication on architectures that support simultaneous communication over multiple...

37
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan

Upload: johnathan-mclaughlin

Post on 27-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links

Ernie Chan

Authors

Ernie Chan Robert van de Geijn

Department of Computer Sciences

The University of Texas at Austin

William Gropp Rajeev Thakur

Mathematics and Computer Science Division

Argonne National Laboratory

Testbed Architecture

IBM Blue Gene/L3D torus point-to-point interconnect networkOne rack

1024 dual-processor nodes Two 8 x 8 x 8 midplanes

Special feature to send simultaneously Use multiple calls to MPI_Isend

Outline

Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Model of Parallel Computation

Target Architectures Distributed-memory parallel architectures

Indexingp computational nodes Indexed 0 … p - 1

Logically Fully ConnectedA node can send directly to any other node

Model of Parallel Computation

TopologyN-dimensional torus

5

9 11

3

7

8

0

10

12 13 15

1

4

14

6

2

Model of Parallel Computation

Old Model of Communicating Between NodesUnidirectional sending or receiving

Model of Parallel Computation

Old Model of Communicating Between NodesSimultaneous sending and receiving

Model of Parallel Computation

Old Model of Communicating Between NodesBidirectional exchange

Model of Parallel Computation

Communicating Between NodesA node can send or receive with 2N other

nodes simultaneously along its 2N different links

Model of Parallel Computation

Communicating Between NodesCannot perform bidirectional exchange on any

link while sending or receiving simultaneously with multiple nodes

Model of Parallel Computation

Cost of Communication

α + nβ

α: startup time, latencyn: number of bytes to communicateβ: per data transmission time, bandwidth

Outline

Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Sending Simultaneously

Old Cost of Communication with Sends to Multiple NodesCost to send to m separate nodes

(α + nβ) m

Sending Simultaneously

New Cost of Communication with Simultaneous Sends

(α + nβ) m

can be replaced with

(α + nβ) + (α + nβ) (m - 1)

Sending Simultaneously

New Cost of Communication with Simultaneous Sends

(α + nβ) m

can be replaced with

(α + nβ) + (α + nβ) (m - 1) τ

Cost of one send Cost of extra sends

Sending Simultaneously

New Cost of Communication with Simultaneous Sends

(α + nβ) m

can be replaced with

(α + nβ) + (α + nβ) (m - 1) τ

Cost of one send Cost of extra sends

0 ≤ τ ≤ 1

Sending Simultaneously

Benchmarking Sending SimultaneouslyLogarithmic-Logarithmic timing graphsMidplane – 512 nodesSending simultaneously with 1 – 6 neighbors8 bytes – 4 MB

Sending Simultaneously

Sending Simultaneously

Cost of Communication with Simultaneous Sends

(α + nβ) (1 + (m - 1) τ)

Sending Simultaneously

Sending Simultaneously

Outline

Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Collective Communication

Broadcast (Bcast)Motivating example

Before After

Outline

Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Generalized Algorithms

Short-Vector AlgorithmsMinimum-Spanning Tree

Long-Vector AlgorithmsBucket Algorithm

Generalized Algorithms

Minimum-Spanning Tree

Generalized Algorithms

Minimum-Spanning TreeDivide p nodes into N+1 partitions

Generalized Algorithms

Minimum-Spanning TreeDisjointed partitions on N-dimensional mesh

5

9 11

3

7

8

0

10

12 13 15

1

4

14

6

2

Generalized Algorithms

Minimum-Spanning TreeDivide dimensions by a decrementing counter

from N+1

5

9 11

3

7

8

0

10

12 13 15

1

4

14

6

2

Generalized Algorithms

Minimum-Spanning TreeNow divide into 2N+1 partitions

5

9 11

3

7

8

0

10

12 13 15

1

4

14

6

2

Outline

Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Performance Results

Single point-to-pointcommunication

Performance Results

my-bcast-MST

Outline

Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Conclusion

IBM Blue Gene/L supports functionality of sending simultaneouslyBenchmarking along with model checking

verifies this claim New generalized algorithms show clear

performance gains

Conclusion

Future DirectionsRoom for optimization to reduce

implementation overheadWhat if not using MPI_COMM_WORLD?Possible new algorithm for Bucket Algorithm

Questions? [email protected]