locality-aware connection management and rank assignment for wide-area mpi

45
Locality-aware Connection Management and Rank Assignment for Wide-area MPI Hideo Saito Kenjiro Taura The University of Tokyo May 16, 2007

Upload: miyo

Post on 23-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Locality-aware Connection Management and Rank Assignment for Wide-area MPI. Hideo Saito Kenjiro Taura The University of Tokyo May 16, 2007. Background. Increase in the bandwidth of WANs ➭ More opportunities to perform parallel computation using multiple clusters. WAN. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Hideo Saito Kenjiro TauraThe University of TokyoMay 16, 2007

Page 2: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

2

Background Increase in the bandwidth of WANs➭ More opportunities to perform parallel

computation using multiple clusters

WAN

Page 3: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

3

Requirements for Wide-area MPI

Wide-area connectivity Firewalls and private addresses Only some nodes can connect to each other Perform routing using the connections that

happen to be possible

Firewall

NAT

Page 4: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

4

Reqs. for Wide-area MPI (2) Scalability

The number of conns. must be limited in order to scale to thousands of nodes Various allocation limits of the

system (e.g., memory, file descriptors, router sessions)

Simplistic schemes that may potentially result in O(n2) connections won’t scale Lazy connect strategies work for

many apps, but not for those that involve all-to-all communication

Page 5: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

5

Reqs. for Wide-area MPI (3) Locality awareness

To achieve high performance with few conns, select conns. in a locality-aware manner Many connections with nearby

nodes, few connections with faraway nodes

Many conns.within a cluster

Few conns.between clusters

Page 6: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

6

Reqs. for Wide-area MPI (4) Application awareness

Select connections according to the application’s communication pattern

Assign ranks* according to the application’s communication pattern

Adaptivity Automatically, without tedious manual configuration

* rank = process ID in MPI

Page 7: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

7

Contributions of Our Work Locality-aware connection management

Uses latency and traffic information obtained from a short profiling run

Locality-aware rank assignment Uses the same info. to discover rank-process

mappings with low comm. overhead

➭ Multi-Cluster MPI (MC-MPI) Wide-area-enabled MPI library

Page 8: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

8

Outline

1. Introduction2. Related Work3. Proposed Method

Profiling Run Connection Management Rank Assignment

4. Experimental Results5. Conclusion

Page 9: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

9

Grid-enabled MPI Libraries MPICH-G2 [Karonis et al. ‘03], MagPIe [Kielma

nn et al. ‘99] Locality-aware communication optimizations

E.g., wide-area-aware collective operations (broadcast, reduction, ...)

Doesn’t work with Firewalls

Page 10: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

10

Grid-enabled MPI Libraries (cont’d)

MPICH/MADIII [Aumage et al. ‘03], StaMPI [Imamura et al. ‘00] Forwarding mechanisms that

allow nodes to communicate even in the presence of FWs

Manual configuration Amount of necessary config.

becomes overwhelming as more resources are used

Forward Firewall

Page 11: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

11

P2P Overlays Pastry [Rowstron et al. ’00]

Each node maintains just O(log n) connections Messages are routed using those connections Highly scalable, but routing properties are unfavora

ble for high performance computing Few connections between nearby nodes Messages between nearby nodes need to be forwarded, c

ausing large latency penalties

Page 12: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

12

Adaptive MPI Huang et al. ‘06

Performs load balancing by migrating virtual processors Balance the exec. times of the physical processors Minimize inter-processor communication

Adapts to apps. by tracking the amount of communication performed between procs.

Assumes that the communication cost of every processor pair is the same MC-MPI takes differences in communication costs into ac

count

Physical Processor

Virtual Processor

Page 13: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

13

Lazy Connect Strategies MPICH [Gropp et al. ‘96], Scalable MPI over Inf

iniband [Yu et al. ‘06] Establish connections only on demand Reduces the number of conns. if each proc. only co

mmunicates with a few other procs. Some apps. generate all-to-all comm. patterns, res

ulting in many connections E.g., IS in the NAS Parallel Benchmarks

Doesn’t extend to wide-area environments where some communication may be blocked

Page 14: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

14

Outline

1. Introduction2. Related Work3. Proposed Method

Profiling Run Connection Management Rank Assignment

4. Experimental Results5. Conclusion

Page 15: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

15

Overview of Our Method

Latency matrix (L) Traffic matrix (T)

Locality-aware connection management Locality-aware rank assignment

Short Profiling Run

Optimized Real Run

Page 16: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

16

Outline

1. Introduction2. Related Work3. Proposed Method

Profiling Run Connection Management Rank Assignment

4. Experimental Results5. Conclusion

Page 17: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

17

Latency Matrix

Latency matrix L = {lij} lij: latency between processes i and j in the target e

nvironment Each process autonomously measures the RTT bet

ween itself and other processes Reduce the num. of measurements by using the tri

angular inequality to estimate RTTs

rttpr

rttpq

rttrq

p

r

q

if rttpr>αrttrq: rttpq=rttpr

(α: constant)

Page 18: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

18

Traffic Matrix

Traffic matrix T = {tij} tij: traffic between ranks i and j in the target applica

tion Many applications repeat similar communication pa

tterns➭ Execute the application for a short amount of time

and make tij the number of transmitted messages

(E.g., one iteration of an iterative   app.)

Page 19: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

19

Outline

1. Introduction2. Related Work3. Proposed Method

Profiling Run Connection Management Rank Assignment

4. Experimental Results5. Conclusion

Page 20: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

20

Connection Management

Bounding Graph Spanning Tree Lazy Connection Establishment

MPI_Init Application Body

Candidateconnections

Establishcandidate

connectionson demand

Page 21: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

21

Selection of Candidate Connections

Each process selects O(log n) neighbors based on L and T : parameter that controls connection dens

ity n: number of processes

/ /2 /4

...

Many nearbyprocesses

Few farawayprocesses

near far

Page 22: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

22

Bounding Graph Procs. try to establish temporary

conns. to their selected neighbors The collective set of

successful connections ➭ Bounding graph (Some conns. may fail due to FWs)

Temporaryconnections

BoundingGraph

Page 23: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

23

Routing Table Construction Construct a routing table using

just the bounding graph Close the temporary connections

Conns. of the bounding graph are reestablished lazily as “real” conns. Temporary conns. => small bufs. Real conns. => large bufs.

BoundingGraph

Page 24: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

24

Lazy Connection Establishment

FW

Bounding Graph

FW

Spanning Tree

FW

Send connectrequest usingspanning tree

Connectin reversedirection

Lazy connectfails due to FW

Page 25: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

25

Outline

1. Introduction2. Related Work3. Proposed Method

Profiling Run Connection Management Rank Assignment

4. Experimental Results5. Conclusion

Page 26: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

26

Commonly-used Method Sort the processes by host name (or IP

address) and assign ranks in that order Assumptions

Most communication takes place between processes with close ranks

The communication cost between processes with close host names is low

However, Applications have various comm. patterns Host names don’t necessarily have a

correlation to communication costs

Page 27: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

27

Our Rank Assignment Scheme Find a rank-process mapping with low commun

ication overhead Map the rank assignment problem to the Quadratic

Assignment Problem QAP

Given two nxn cost matrices, L and T, find a permutation p of {0, 1, ..., n-1} that minimizes:

Page 28: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

28

Solving QAPs NP-Hard, but there are heuristics for finding go

od suboptimal solutions Library based on GRASP [Resende et al. ’96] Test against QAPLIB [Burkard et al. ’97]

Instances of up to n = 256 n processors for problem size n Approximate solutions that were within one to two p

ercent of the best known solution in under one second

Page 29: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

29

Outline

1. Introduction2. Related Work3. Profiling Run4. Connection Management5. Rank Assignment6. Experimental Results7. Conclusion

Page 30: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

30

Experimental Environment

Xeon/Pentium M Linux Intra-cluster RTT: 60-120

microsecs TCP send/recv bufs: 256K

B ea.

FW

sheepXX(64 nodes)

istbsXXX(64 nodes)

hongoXXX(64 nodes)

0.3ms

4.4ms

4.3ms

chibaXXX(64 nodes)

6.9ms

6.8ms

10.8ms

Page 31: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

31

Experiment 1: Conn. Management

Measure the performance of the NPB with limited numbers of connections MC-MPI

Limit the number of connections to 10%, 20%, ..., 100% by varying

Random Establish a comparable number of connections

randomly

Page 32: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

32

BT, LU, MG and SP

50%

60%

70%

80%

90%

100%

0% 25% 50% 75% 100%Maximum % of connections

Rel

ativ

e pe

rfor

man

ce

MC-MPIRandom

LU (Lower-Upper)

SOR(Successive Over-Relxation)

Page 33: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

33

BT, LU, MG and SP (2)

60%

70%

80%

90%

100%

110%

0% 25% 50% 75% 100%Maximum % of connections

Rel

ativ

e pe

rfor

man

ce

MC-MPIRandom

60%

70%

80%

90%

100%

0% 25% 50% 75%100%Maximum % of connections

Rel

ativ

e pe

rfor

man

ce

MC-MPIRandom

MG (Multi-Grid)BT (Block Tridiagonal)

Page 34: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

34

BT, LU, MG and SP (3)

40%

50%

60%

70%

80%

90%

100%

0% 25% 50% 75% 100%Maximum % of connections

Rel

ativ

e pe

rfor

man

ce

MC-MPIRandom

SP (Scalar Pentadiagonal)

% of connections actually established was lower than that shown by the x-axis B/c of lazy connection

establishment To be discussed in

more detail later

Page 35: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

35

EP

60%

70%

80%

90%

100%

0% 25% 50% 75% 100%Maximum % of connections

Rel

ativ

e pe

rfor

man

ce

MC-MPIRandom

EP involves very little communication

EP (Embarrassingly Parallel)

Page 36: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

36

IS

0%

100%

200%

300%

400%

0% 25% 50% 75% 100%Maximum % of connections

Rel

ativ

e pe

rfor

man

ce MC-MPIRandom

0%

100%

200%

300%

400%

500%

0 32 64 96 128Buffer size (KB)

Rel

ativ

e pe

rfor

man

ce

20%60%100%

IS (Integer Sort)

Performance decreasedue to congestion!

Page 37: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

37

Experiment 2: Lazy Conn. Establish.

Compare our lazy conn. establishment method with an MPICH-like method MC-MPI

Select so that the maximum number of allowed connections is 30%

MPICH-like Establish connections on demand without preselecting ca

ndidate connections(we can also say that we preselect all connections)

Page 38: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

38

Experiment 2: Results

0%

20%

40%

60%

80%

100%

BT EP IS MG LU SP

Benchmark

Con

nect

ions

est

ablis

hed

MC- MPIMPICH- like

Connections Established

0%

100%

200%

300%

400%

BT EP IS MG LU SP

Benchmark

Rel

ativ

e Per

form

ance

MC- MPIMPICH- like

Relative Performance

Comparable number ofconns. except for IS

Comparable performanceexcept for IS

Page 39: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

39

Experiment 3: Rank Assignment

Compare 3 assignment algorithms Random Hostname (24 patterns)

Real host names (1) What if istbsXXX were named sheepXX, etc. (23)

MC-MPI (QAP) chibaXXX

sheepXX

hongoXXXistbsXXX

Page 40: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

40

LU and MG

0

0.5

1

1.5

2

2.5

3

Spe

edup

vs.

Ran

dom

0

0.5

1

1.5

2

Spe

edup

vs.

Ran

dom

Random Hostname Hostname (Best) Hostname (Worst) MC-MPI (QAP)

LU MG

Page 41: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

41

BT and SP

0

1

2

3

4

Spe

edup

vs.

Ran

dom

0

1

2

3

4

Spe

edup

vs.

Ran

dom

Random Hostname Hostname (Best) Hostname (Worst) MC-MPI (QAP)

BT SP

Page 42: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

42

BT and SP (cont’d)

Sourc

e

Destination

Hostname

MC-MPI (QAP)

Traffic Matrix

Cluster A

Cluster B

Cluster C

Cluster D

Rank

Rank

Rank Assignment

Page 43: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

43

EP and IS

0

0.2

0.4

0.6

0.8

1

Spe

edup

vs.

Ran

dom

0

0.2

0.4

0.6

0.8

1

Spe

edup

vs.

Ran

dom

Random Hostname Hostname (Best) Hostname (Worst) QAP (MC-MPI)

EP IS

Page 44: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

44

Outline

1. Introduction2. Related Work3. Profiling Run4. Connection Management5. Rank Assignment6. Experimental Results7. Conclusion

Page 45: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

45

Conclusion MC-MPI

Connection management High performance with connections between just

10% of all process pairs Rank assignment

Up to 300% faster than locality-unaware assignments

Future Work An API to perform profiling w/in a single run Integration of adaptive collectives