topology’aware’resource’allocation’and’mapping ......2012/02/15 · ddcmd: pppm algorithm...

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551

LLNL-PRES-529376

SIAM Conference on Parallel Processing ◆ February 15, 2012

Topology aware resource allocation and mapping challenges at exascale

Abhinav Bhatele and Laxmikant V. Kale

IBM Blue Gene

LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

Qbox: FPMD Simulations

• Two-dimensional process grid

• Collective communication over rows and columns

• MPI_Bcast (14.5%)and MPI_Allreduce (3.7%) on 8K nodes

3

performance of MPI_Bcast, since this dominates communication costs in Qbox (see Table 1). Figure 2 shows the communication pattern of a single broadcast on a 4x4 plane of BG/L nodes using three eight-node communicators. Broadcasts over a compact rectangle of nodes (left panel), which use the torus network’s broadcast functionality, have the most balanced packet counts as well as the lowest maximum count. When we split the nodes

across multiple lines resulting in disjoint sets of nodes (middle panel), the communication requires significantly more communication packets with less balanced link utilization. The node mappings in Figure 1c) and 1d), on the other hand, lead to a more balanced link utilization (as illustrated in Figure 2, right panel) and hence to higher overall performance.

(a) (b)

(c) (d)

Figure 1. Illustration of different node mappings for a 64k-node partition. Each color represents the nodes belonging to one 512-node column of the process grid.

This analysis of Qbox communication led to several node mapping optimizations. In particular, our initial mappings did not optimize the placement of tasks within communicators. We have refined the bipartite mapping shown in Figure 1c to map tasks with a variant of Z ordering within a plane. This modification effectively isolates the subtree of a binomial software tree broadcast within subplanes of the torus. In addition, we observed that substantial time was spent in

MPI_Type_commit calls in the BLACS library. The types being created were in many cases just contiguous native types, which allowed us to hand-optimize BLACS to eliminate calls in these cases. We are investigating other possible optimizations, including multi-level collective operations [13] that could substantially improve the performance of the middle configuration in Figure 2.

Francois Gygi et al. Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing (SC '06). ACM, New York, NY, USA

39.5 TFlop/s 38.2 TFlop/s

64.0 TFlop/s 64.7 TFlop/s


OpenAtom: ab-initio MD

• Transposes, multicasts and reductions

• Significant contention on the network

4

OPTIMIZING COMMUNICATION FOR CHARM++ APPLICATIONS 7

!"#$%&'$()*$+%,+$-.)

/&$+"#$%&

"-$-&0

'+$1&0

"-$-&0

'+$1&0

'+$1&0

"-$-&0

"-$-&0

Figure 2. Mapping of different chare arrays to the 3D torus of the machine

wise communication). The number of planes in GSpace is different from that in RealSpace.GSpace also interacts with the PairCalculator arrays. Each plane of GSpace, G(∗, p) interactswith the corresponding plane, P (∗, ∗, p) of the PairCalculators (plane-wise communication)through multicasts and reductions. So, GSpace interacts state-wise with RealSpace and plane-

wise with PairCalculators. If all planes of GSpace are placed together, then the transpose

operation is favored, but if all states of GSpace are placed together, the multicasts/reductions

are favored. To strike a balance between the two extremes, a hybrid map is built, where a

subset of planes and states of these three arrays are placed on one processor.

Mapping GSpace and RealSpace Arrays: Initially, the GSpace array is placed on the torus

and other objects are mapped relative to GSpace’s mapping. The 3D torus is divided into

rectangular boxes (which will be referred to as “prisms”) such that the number of prisms is

equal to the number of the planes in GSpace. The longest dimension of the prism is chosen to

be same as one dimension of the torus. Each prism is used for all states of one plane of GSpace.

Within each prism for a specific plane, the states in G(*, p) are laid out in increasing order along

the long axis of the prism. Once GSpace is mapped, the RealSpace objects are placed. Prisms

perpendicular to the GSpace prisms are created which are formed by including processors

holding all planes for a particular state of GSpace, G(s, ∗). These prisms are perpendicular tothe GSpace prisms and the corresponding states of RealSpace, R(s, ∗) are mapped on to theseprisms. Figure 2 shows the GSpace objects (on the right) and the RealSpace objects (in the

foreground) being mapped along the long dimension of the torus (box in the center).

Mapping of Density Arrays: RhoR objects communicate with RealSpace plane-wise and hence

Rρ(p, ∗) have to be placed close to R(∗, p). To achieve this, we start with the centroid of theprism used by R(∗, p) and place RhoR objects in proximity to it. RhoG objects, Gρ(p) aremapped near RhoR objects, Rρ(p, ∗) but not on the same processors as RhoR to maximizeoverlap. The density computation is inherently smaller and hence occupies the center of the

torus.

Copyright c� 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 00:1–7Prepared using cpeauth.cls

0

2.75

5.5

8.25

11

1024 2048 4096 8192T

ime

per

step

(s)

Number of cores

Default MappingTopology Mapping

Abhinav Bhatele, Eric Bohm, and Laxmikant V. Kale, 2011. Optimizing communication for Charm++ applications by reducing network contention. Concurr. Comput. : Pract. Exper. 23, 2, pp. 211-222, February 2011.


ddcMD: PPPM algorithm

• Optimize communication within short-range and long-range and also between the two

• Reduction in comm. time:• short-range: 3 times• long-range (FFT): 2 times• between the two: 50%

• Overall reduction: 36% on 32K nodes (64 x 32 x 16)

5

Figure 2: Task layout on the BlueGene/P torus. Cyan arrowsindicate communication from MD tasks to collector tasks, bluearrows indicate communication from collector tasks to FFTtasks.

is the reduction of the contributions to ρ at each point from thesemultiple tasks to a single sum.

To perform the reduction we nominate a subset of the particletasks as “collector” tasks (see Fig. 2). Each mesh point is uniquelyassigned to a collector task that is responsible for gathering all con-tributions to ρ for that mesh point and performing the sum. Thenumber and arrangement of the collector tasks is a tunable param-eter and all communication is local.

In Stage 2 of the communication, each collector task sends meshinformation to the appropriate mesh task using MPI_Isend. Wethink of this stage as a gather operation since the mesh data is beinggathered to the mesh tasks. This communication is long-range, butcan be efficiently organized.

Once the long-range portion of the potential has been calculatedwe perform the communication stages again but in reverse order.The mesh tasks scatter mesh data back to the collector tasks whichin turn send values to the neighboring particle tasks. To maximizethe overlap of communication of computation the MPI_Irecvson the collector tasks are posted before we start the pair calculation.This allows data to begin moving from the mesh tasks as soon as itis available.

In benchmark calculations on Dawn using 144,384 processors(9,216 mesh tasks and 135,168 particle tasks, of which 22,400 arecollector tasks) we observe that when work is properly balancedbetween the particle and mesh tasks there is a pronounced asym-metry in the communication times between the particle and meshtasks. Although the mesh tasks spend roughly 15% of total run-time waiting for data to arrive, the particle tasks spend under 2% ofthe total runtime sending it. Asynchronous communication allowsthe particle tasks to continue with other work while the commu-nication proceeds. Hence, we have successfully accomplished thegoal of minimizing communication time on the particle tasks byoverlapping the communication of mesh data with the explicit paircomputation.

This success demonstrates the effectiveness of the direct mem-ory access (DMA) engine that was added to the BG/P design asan improvement over BG/L[23]. The DMA is directly coupled tothe L3 (shared) cache on each node and is responsible for sendingand receiving data to and from the torus network. The CPU is thusrelieved of these tasks and is free to continue on to other compu-tations. From comparisons with benchmark simulations performedon BG/L, it is clear that there is a significant benefit from the DMA.

The two-stage approach just described has at least two advan-tages over a single stage method in which each particle task simplysends all of the mesh points it populates to the appropriate meshtasks and the reduction of partial sums is performed on the meshtasks. The first advantage is a reduction of communication band-width from the particle to the mesh tasks. Although the number ofmesh points to which an particle task contributes varies with ng , forour typical problems of interest it is roughly 2–5 times the numberthat lie strictly within its computational domain. Hence a singlestage solution would require 2–5 times the network bandwidth tocomplete communication in the same time. In the two-stage ap-proach the mesh points are gathered and reduced locally so a largernumber of torus links can be active increasing the aggregate band-width available to communicate mesh points. A second advantageis that the number of collector tasks can be tuned to optimize totalcommunication cost. Changing the number of collector tasks al-lows trade offs between the number of messages sent to each meshtask in Stage 2 (with corresponding changes in message size) andthe bandwidth available for the reduction in Stage 1.

4.3 LayoutFor a 3D torus network, the assignment of MPI tasks onto com-

pute nodes at specific torus coordinates can significantly impactparallel efficiency at full machine scale. It is necessary to opti-mize communication both within the short-range and long-rangesubcommunicators, as well as between the two. For the latter, wefocused on splitting the torus into separate sections for the particleand mesh tasks such that communication between the “collector”tasks described in section 4.2 and the mesh tasks takes place along asingle torus dimension to reduce contention and avoid bottlenecks,as illustrated in Figure 2. The tasks are then ordered within eachsubcommunicator to provide nearest neighbor communication forspatially adjacent particle tasks and reduce transpose communica-tion times for the mesh tasks.

For a system of 1.2 billion particles on a 64 × 32 × 16 BlueGene/P partition (32,768 nodes, 131,072 tasks), we see a signif-icant decrease in communication times using a custom task mapconstructed as described above compared with the default (TXYZ)layout. The total run time decreased by 36% when the custom map-ping was used, with the greatest improvement being seen in theintra-particle task communication times, which decreased by a fac-tor of 3, likely due to the poor correspondence of the default map-ping to the simulation box shape. Communication within the 3DFFT was decreased by over a factor of 2, and communication be-tween the particle and mesh tasks was more than 50% faster. Theseresults highlight the need to carefully understand and optimize thecommunication patterns on large torus networks.

4.4 FFT ImplementationAs described in Section 2.1, the long-range interaction term in

Eqn. 3 involves the use of a 3D Fourier transform between thereal-space (real-valued) density and the k-space (complex-valued)density. To obtain optimal 3D FFT performance in the massively-parallel regime a custom real-to-complex 3D Fast Fourier Trans-form implementation (bigFFT) was developed using a 2D decom-

D. F. Richards et al. Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09). ACM, New York, NY, USA


Other applications

• Blue Matter - Blake G. Fitch et al. Blue matter: approaching the limits of concurrency for classical molecular dynamics. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing (SC '06). ACM, New York, NY, USA

• NAMD - Abhinav Bhatele et al. Dynamic topology aware load balancing algorithms for molecular dynamics applications. In Proceedings of the 23rd international conference on Supercomputing (ICS '09). ACM, New York, NY, USA

• NAS BT, CG, MG - Brian E. Smith et al. Performance Effects of Node Mappings on the IBM Blue Gene/L Machine. In Euro-Par, pages 1005–1013, 2005

• SAGE, UMT2000 - G. Bhanot et al. Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development, 49(2/3):489–500, 2005

• GTC Particle-in-cell - Leonid Oliker et al. Scientific Application Performance on Candidate PetaScale Platforms. In Proceedings of IEEE Parallel and Distributed Processing Symposium (IPDPS), March 2007

6

Cray XT/XE


pF3D

• Slowest communicators are those that are most spread out on the network

• Runs on Cielo - Cray XE6 at LANL

8

(a) (b) (c) (d)

Fig. 4: Message passing during 2D FFTs for a single xy-slab. (a) Communication patterns for xyzt in x-direction, color

coded by communicator. Only links in a single direction are used. (b) Communication for block-mapped data, color coded by

communicator. The communicating nodes are distributed within the 4x4x4 block and uses links in multiple directions. (c) xyzt

in the y-direction uses vertical links. (d) Y-communication for the block mapped scheme uses links in multiple directions.

(a) (b)

Fig. 5: Message passing for y-communication on Cielo. (a) The locations on the Cielo torus are shown for the five fastest y-

communicators. (b) The locations on the Cielo torus are shown for the five fastest y-communicators. The slowest communicators

are all ”long and skinny” and are thus subject to contention for bandwidth along the links in the z-direction.

The cielo message passing performance doesn’t show a

strong trend as a function of the number of processes. The

variability among the three runs with 1024 processes appears

to be bigger than differences due to the number of processes.

All runs used a 16x16xN decomposition so that X messages

are all passed on node. All Y messages go off node and are

significantly slower than the X messages.

The message passing rate during the 32K process Cielo run

varied between 80 and 100 MB/s for different batch jobs.

The variability appears to be due to the placement of the

communicators on the interconnect. The run used a 16x16x128

decomposition. The message passing rate for Y messages is

probably about 1 GB/s per node. That is significantly less than

the roughly 5 GB/s rate that MPI benchmarks achieve.

We investigated the effect of varying the MPI eager limit.

It made a noticeable difference in smaller runs but does not

appear to help for large runs. The likely explanation is that a

large run is statistically very likely to have a communicator

that is slow and that masks any effects of varying the MPI

parameter. We plan to investigate custom mappings between

the physical location of domains and their placement on the

interconnect in future work.

VI. MTBI RESULTS

Large computers like Cielo have large numbers of compo-

nents. In spite of carefully designed RAS systems, applications

Steven H. Langer et al. Cielo Full-System Simulations of Multi-Beam Laser-Plasma Interaction in NIF Experiments, In Proceedings of Cray User Group, 2011.


OpenAtom: ab-initio MD

9

OPTIMIZING COMMUNICATION FOR CHARM++ APPLICATIONS 7

!"#$%&'$()*$+%,+$-.)

/&$+"#$%&

"-$-&0

'+$1&0

"-$-&0

'+$1&0

'+$1&0

"-$-&0

"-$-&0

Figure 2. Mapping of different chare arrays to the 3D torus of the machine

wise communication). The number of planes in GSpace is different from that in RealSpace.GSpace also interacts with the PairCalculator arrays. Each plane of GSpace, G(∗, p) interactswith the corresponding plane, P (∗, ∗, p) of the PairCalculators (plane-wise communication)through multicasts and reductions. So, GSpace interacts state-wise with RealSpace and plane-

wise with PairCalculators. If all planes of GSpace are placed together, then the transpose

operation is favored, but if all states of GSpace are placed together, the multicasts/reductions

are favored. To strike a balance between the two extremes, a hybrid map is built, where a

subset of planes and states of these three arrays are placed on one processor.

Mapping GSpace and RealSpace Arrays: Initially, the GSpace array is placed on the torus

and other objects are mapped relative to GSpace’s mapping. The 3D torus is divided into

rectangular boxes (which will be referred to as “prisms”) such that the number of prisms is

equal to the number of the planes in GSpace. The longest dimension of the prism is chosen to

be same as one dimension of the torus. Each prism is used for all states of one plane of GSpace.

Within each prism for a specific plane, the states in G(*, p) are laid out in increasing order along

the long axis of the prism. Once GSpace is mapped, the RealSpace objects are placed. Prisms

perpendicular to the GSpace prisms are created which are formed by including processors

holding all planes for a particular state of GSpace, G(s, ∗). These prisms are perpendicular tothe GSpace prisms and the corresponding states of RealSpace, R(s, ∗) are mapped on to theseprisms. Figure 2 shows the GSpace objects (on the right) and the RealSpace objects (in the

foreground) being mapped along the long dimension of the torus (box in the center).

Mapping of Density Arrays: RhoR objects communicate with RealSpace plane-wise and hence

Rρ(p, ∗) have to be placed close to R(∗, p). To achieve this, we start with the centroid of theprism used by R(∗, p) and place RhoR objects in proximity to it. RhoG objects, Gρ(p) aremapped near RhoR objects, Rρ(p, ∗) but not on the same processors as RhoR to maximizeoverlap. The density computation is inherently smaller and hence occupies the center of the

torus.

Copyright c� 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 00:1–7Prepared using cpeauth.cls

Abhinav Bhatele, Eric Bohm, and Laxmikant V. Kale, 2011. Optimizing communication for Charm++ applications by reducing network contention. Concurr. Comput. : Pract. Exper. 23, 2, pp. 211-222, February 2011.

0

2

4

6

8

512 1024 2048T

ime

per

step

(s)

Number of cores

Default MappingTopology Mapping

• Job schedulers on Cray are not topology aware typically

• Performance Benefit at 2048 cores: 40% (XT3), 45% (BG/P), 41% (BG/L)


Other work

• M. Muller and Michael Resch. PE mapping and the congestion problem in the T3E. In Proceedings of the Fourth European Cray-SGI MPP Workshop, Garching, Germany, 1998

• Thierry Cornu and Michel Pahud. Contention in the Cray T3D Communi- cation Network. In Euro-Par ’96: Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II, pages 689–696, Lon- don, UK, 1996. Springer-Verlag

• Eduardo Huedo and Manuel Prieto and Ignacio Martin Llorente and Fran- cisco Tirado. Impact of PE Mapping on Cray T3E Message-Passing Per- formance. In Euro-Par ’00: Proceedings from the 6th International Euro- Par Conference on Parallel Processing, pages 199–207, London, UK, 2000. Springer-Verlag

• Deborah Weisser, Nick Nystrom, Chad Vizino, Shawn T. Brown, and John Urbanic. Optimizing Job Placement on the Cray XT3. 48th Cray User Group Proceedings, 2006

10

Infiniband clusters


Optimizing collective communication

12

D. K. Panda, K. Schulz, B. Barth and A. Majumdar,Topology-Aware MPI Communication and Scheduling for Petascale Systems, Poster, NSF, STCI, https://confluence.pegasus.isi.edu/download/attachments/5242944/topology-aware-poster.pdf

Topology-Aware MPI Communication and Scheduling for Petascale SystemsPIs: D. K. Panda (The Ohio State University), K. Schulz and B. Barth (Texas Advanced Computing Center)

and A. Majumdar (San Diego Supercomputer Center)

Motivation Vision and Problem Statement Framework and Approach

Network Topology of TACC Ranger Current Job Allocation Strategies and their Impact: A Case Study with TACC Ranger

Application Level Performance Impact: A Case Study with MPCUGLES

Topology Aware MPI_Gather Design

Topology Aware MPI_Scatter Design Conclusions and Continuing Work

Job allocation for the entire system Jobs using 16-4800 cores

Jobs using 4800-16000 cores Jobs using 16000-64000 cores

Table 1: Data Collected from TACC Ranger System

Modern networks (like InfiniBand and 10 GigE) are capable of providing topology and routing information

Research Challenge:

Can the next-generation petascale systems provide topology-aware MPI communication, mapping and scheduling which can improve performance and scalability for a range of scientific applications?

• On the left, we compare performance of MPCUGLES on the Normal Batch Queue as compared to runs conducted on an exclusive queue

• We observe that performance may be impacted by up to 15%• On the right, we compare performance of MPCUGLES on the Normal Batch

Queue, but with special randomization of hostfiles

• We observe that there is greater variance in performance with up to 16% difference between best case and worst case runs

Topology AwareTask Mapping

Topology Aware Communication

Collectives Point To PointIntegratedEvaluation

Topology Information Interface

Topology Graph Network Status Graph

Dynamic State & Topology Management Framework

Unified Abstraction Layer

EthernetNetwork Management System Topology

DiscoveryTraffic

Monitoring

Enhanced Subnet Management Layer

High Performance Interconnect

Job Scheduler

Topology-Aware Scheduling

Performance Feedback

Dependency

Legend:

Profiling Information

TurbulencePrediction

EarthquakeModeling

MPI Applications

FlowModeling

KineticSimulation

Application Hints

Rack 1 Rack 2 Rack 82

InfiniBand Switch (Magnum) 1 InfiniBand Switch (Magnum) 2

• Ranger's compute nodes are a blade-based configuration• 12 blades in a chassis, 4 chassis in a rack, 82 racks in all for a total of 4096

compute nodes.

• Each chassis embeds a NEM (network express module) combining a 24-port leaf switch with 12 dual-rail SDR HCA's, one per node.

• Each NEM is connected to the core Magnum switch(es) with 4, 12X connectors, 2 to each Magnum.

80 1 2 3 4 5 6 7

1.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Run #

Perf

orm

ance

Nor

mal

ized

to E

xclu

sive

Que

ue

Batch Queue with Normal Ordering on 192 cores

60 1 2 3 4 5

1.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Run #

Perf

orm

ance

Nor

mal

ized

to E

xclu

sive

Que

ue

Batch Queue with Random Ordering on 192 cores

Research Questions:

(1) What are the topology aware communication and scheduling requirements of petascale applications?

(2) How to design a network topology and state management framework with static and dynamic network information?

(3) How to design topology-aware point-to-point and collective communication schemes?

(4) How to design topology-aware task mapping and scheduling schemes?

(5) How to design a flexible topology information interface?

• On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions

• We observe that the algorithm is impacted by background traffic• On the right, we compare performance of proposed topology-aware algorithm

under quiet and busy conditions

• Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions

• 23% performance improvement under quiet network conditions and 10% under busy conditions

• The graphs present the analysis of the jobs run on TACC ranger system in September '09 • There were a total of 19,441multi-node jobs, most of which used 16-4800 cores• We observe that for the majority of jobs, average inter-node distance is significantly more

than the best possible

• Modern high-end computing (HEC) systems enable scientists to tackle grand challenge problems

• Design and deployment of such ultra-scale HEC systems is being fueled by the increasing use of multi-core/many-core architectures and commodity networking technologies like InfiniBand.

• As a recent example, the TACC Ranger system was deployed with a total of 62,976 cores using a fat-tree InfiniBand interconnect to provide a peak performance of 579 TFlops.

• Most current petascale applications are written using the Message Passing Interface (MPI) programming model.

• By necessity, large-scale systems that support MPI are built using hierarchical topologies (multiple levels involving intra-socket, intra-node, intra-blade, intra-rack, and multi-stages within a high-speed switch).

• Current generation MPI libraries and schedulers do not take into account these various levels for optimizing communications

• Consequently, this leads to non-optimal performance and scalability for many applications.

Process Location Number of Hops MPI Latency (us)

Intra-RackIntra-ChassisInter-Chassis

Inter-Rack

0 Hops in Leaf Switch1 Hop in Leaf Switch

3 Hops Across Spine Switch5 Hops Across Spine Switch

1.572.042.452.85

Results of current work:

• We have observed a major impact on end applications if schedulers and communication libraries are not topology aware

• We have proposed topology aware collective communication algorithms for MPI_Gather and MPI_Scatter

• Our proposed algorithms outperform default implementation under both network quiet and busy conditions

Continuing work:

• Work towards a topology-aware scheduling scheme• Adapt more collective algorithms dynamically according to topology

interfacing with schedulers

• Gather more data from real-world application runs• Integrated solutions will be available in future versions of MVAPICH/

MVAPICH2 software

Publications:

• K. Kandalla, H. Subramoni and D. K. Panda, “Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters : Case Studies wih Scatter and Gather”, Communication Architecture for Clusters (CAC) Workshop, in conjunction with IPDPS 2010

Additional Personnel:

• Hari Subramoni, Krishna Kandalla, Sayantan Sur, Karen Tomko (OSU)• Mahidhar Tatineni, Yifeng Cui, Dmitry Pekurovsky (SDSC)

• We conduct experiments with MVAPICH2 stack, which is one of the most popular MPI implementations over InfiniBand, currently used by more than 1,050 organizations worldwide (http://www.mvapich.cse.ohio-state.edu)


• We observe that default algorithm is very sensitive to background traffic, with degradation up to 21% for large messages

• On the right, we compare performance of proposed topology-aware algorithm


• Over 50% performance improvement even when network was busy












Research Challenge:














DiscoveryTraffic

Monitoring



Job Scheduler



Dependency

Legend:



EarthquakeModeling

MPI Applications

FlowModeling

KineticSimulation

Application Hints




compute nodes.



80 1 2 3 4 5 6 7

1.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Run #

Perf

orm

ance

Nor

mal

ized

to E

xclu

sive

Que

ue


60 1 2 3 4 5

1.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Run #

Perf

orm

ance

Nor

mal

ized

to E

xclu

sive

Que

ue


Research Questions:






















Inter-Rack



1.572.042.452.85





Continuing work:




MVAPICH2 software

Publications:





















Research Challenge:














DiscoveryTraffic

Monitoring



Job Scheduler



Dependency

Legend:



EarthquakeModeling

MPI Applications

FlowModeling

KineticSimulation

Application Hints




compute nodes.



80 1 2 3 4 5 6 7

1.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Run #

Perf

orm

ance

Nor

mal

ized

to E

xclu

sive

Que

ue


60 1 2 3 4 5

1.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Run #

Perf

orm

ance

Nor

mal

ized

to E

xclu

sive

Que

ue


Research Questions:






















Inter-Rack



1.572.042.452.85





Continuing work:




MVAPICH2 software

Publications:










https://confluence.pegasus.isi.edu/download/attachments/5242944/topology-aware-poster.pdfhttps://confluence.pegasus.isi.edu/download/attachments/5242944/topology-aware-poster.pdf


Mapping is important

13



• As evidenced by several applications on Blue Gene

13



• As evidenced by several applications on Blue Gene• Also on Cray machines, even though:• Link bandwidth - 3.8 GB/s (XT3), 0.425 (BG/P), 0.175 (BG/L)• Bytes per flop - 8.77 (XT3), 0.375 (BG/P and BG/L)

13



• As evidenced by several applications on Blue Gene• Also on Cray machines, even though:• Link bandwidth - 3.8 GB/s (XT3), 0.425 (BG/P), 0.175 (BG/L)• Bytes per flop - 8.77 (XT3), 0.375 (BG/P and BG/L)

• Even more important now that:• Bytes per flop that the network can handle is reducing - 8.77

(XT3), 1.36 (XT4), 0.23 (XT5), 0.23 - 0.46 (XE6)★

13

★ Based on BigBen (XT3, PSC), Jaguar (XT4/XT5, ORNL) and Hopper (XE6, NERSC)


Wormhole Routing

• Ni et al. 1993; Oh et al. 1997 - Equation for modeling message latencies:

• Relatively small sized supercomputers• It was safe to assume message latencies were independent

of distance

sharing network resources. For common networks with asymptotically inadequate

link bandwidth, chances of contention increase as messages travel farther and far-

ther. Network congestion on a link slows down all messages passing through that

link. Delays in message delivery can affect overall application performance. Thus, it

becomes necessary to consider the topology of the machine while mapping parallel

applications to job partitions.

This dissertation will demonstrate that it is not wise to assume that message la-

tencies are independent of the distance a message travels. This assumption has been

supported all these years by the advantages of virtual cut-through and wormhole

routing suggesting that the message latency is independent of the distance in ab-

sence of blocking [[5–12]]. When virtual cut-through or wormhole routing is deployed,

message latency is modeled by the equation,

LfB

∗ D + LB

(1.1)

where Lf is the length of each flit, B is the link bandwidth, D is the number of

links (hops) traversed and L is the length of the message. In absence of blocking, for

sufficiently large messages (where Lf


Topology friendly supercomputers?

15



• Topology aware resource allocation

15



• Topology aware resource allocation• Ability to query the software stack for network topology

information:

• At job allocation time/pre-launch• During runtime

15




information:


• Ability to change the mapping• Fixed shape partitions - mapping can be specified apriori (at job submission

time)

• At job allocation time - specify new mapping before job is launched

15




information:


• Ability to change the mapping• Fixed shape partitions - mapping can be specified apriori (at job submission

time)

• At job allocation time - specify new mapping before job is launched

• Topology aware MPI implementations• Optimized collectives

15


Survey of machines (top500.org)

16

Machine Network / topology Resource ManagerTopology friendly?

K computer Tofu / 6D torus

Tianhe-1A Proprietary fat-tree SLURM

Jaguar (XT5) Seastar2+ / 3D torus Moab/ TORQUE/ ALPS

Nebulae Infiniband

Tsubame 5.0 Infiniband N1 Grid Engine/ PBS

Cielo (XE6) Gemini/ 3D torus Moab/ TORQUE/ ALPS

Pleiades (SGI Altix) Infiniband/ 11D hypercube PBS

Tera 100 Infiniband SLURM

RoadRunner Infiniband

Jugene (BG/P) 3D Torus LoadLeveler


Overall system utilization?

17


Resource Management in HPC

• Manage incoming jobs in the queue (assign priorities) - job/batch scheduler

• Launch/monitor jobs on the compute nodes - job launcher

• Manage and allocate resources - workload/resource manager

18

Typically referred to as a resource manager or job scheduler


Resource manager requirements

• Resource allocator needs to be topology aware• contiguous partitions on 3d mesh/torus• nodes on the same switch for fat-trees

• Job launcher should be able to launch MPI processes on specific nodes

• On BG/P systems, mpirun can take a mapfile as input• On Cray systems, aprun can take a list of node ids

• Within-node mapping• Most job launchers support this through cpu/task affinity options

19


Mapping algorithms

• Steps involved in mapping:• Collect processor topology information• Collect application topology information• Run the mapping algorithm• Communicate the new mapping to processes

• Approaches• Centralized• Completely distributed• Distributed with global view

20


Mapping Challenges

• At exascale, we cannot store O(p) information on one processor

• For asymmetric topologies, how to query/store the topology information in a scalable fashion (MPI topology routines)

• how to collect/store the communication graph in a scalable fashion (MPI graph routines)

• Centralized algorithms: O(n log n) or O(n) might be too slow

• Distributed algorithms: convergence is slow

21


Scalable mapping algorithms

• Scalable algorithms: distributed algorithms with global view

• as good as centralized algorithms?

22


Completely Distributed Mapping

• Do the mapping in parallel• With some sense of global load distribution• Using parallel prefix (discussed in literature)

• Map n objects communicating in a 1D ring pattern to a linear array of p processors

• Map n objects communicating in a 2D stencil pattern to a 2D mesh of p processors

23


• Object i communicates with i-1 and i+1• We want to make cuts in the 1D ring based on the loads

of the objects

• Each processor will then have only two external communication arcs

24

1D ring to a linear array



• Perform a parallel prefix sum between objects and send total load to all objects

25

5 9 1 2 3 1 6 4

5 14 10 3 5 4 7 10

5 14 15 17 15 7 12 14

5 14 15 17 20 21 27 31

v1 v2 v3 v4 v5 v6 v7 v8

Figure 12.1: Prefix sum in parallel to obtain partial sums of loads of all objects up

to a certain object

prefix operation and then migrate to the respective processors.

12.1.1 Complexity Analysis

Let us compare the running time and memory requirements for the centralized versus

completely distributed load balancing algorithms. Let us assume that there are n

objects (or VPs) to be placed on p physical processors.

In the centralized scheme, one processor stores all the information and hence

the memory requirements are proportional to the number of objects, v. In the

distributed case, each processor stores information about its objects which is v/p.

In the centralized case, each processor sends a message to one processor with its

load information, which leads to p messages of size v/p each. On the other hand, in

the parallel prefix there are logv phases and v messages of constant size exchanged

in each phase. These comparisons are summarized in Table 12.1 below.

In the centralized case, if we assume that the fastest algorithm for load balancing

will have a linear running time, the time complexity for the decision making can be

119




25

5 9 1 2 3 1 6 4

5 14 10 3 5 4 7 10

5 14 15 17 15 7 12 14

5 14 15 17 20 21 27 31

v1 v2 v3 v4 v5 v6 v7 v8


to a certain object















119

the basic technique in a simpler context.

2. The second scenario is where we have a two-dimensional array of objects where

each object communicates with two immediate neighbors in its row and col-

umn. We wish to map this group of objects on to a 2D mesh of processors.

12.1 Mapping of a 1D Ring

Problem: Load balancing a 1D array of v objects which communicate in a ring

pattern to a 1D linear array of p processors.

Solution: We want to map these objects on to processors while considering the

load of each object and the communication patterns among the objects. In order to

optimize communication, we want to place objects next to each other on the same

processor as much as possible and cross processor boundaries only for ensuring load

balance. We assume that the IDs of objects denote the nearness in terms of who

communicates with whom. Hence the problem reduces to finding contiguous groups

of objects in the 1D array such that the load on all processors is nearly the same.

We arrange the objects virtually by their IDs and perform a prefix sum in parallel

between them based on the object loads. At the conclusion of a prefix sum, every

object knows the sum of loads of all objects that appear before it (Figure 12.1).

Then the last object broadcasts the sum of loads of all objects so that every object

knows the global load of the system. Each object i, can calculate its destination

processor (di), based on the total load of all objects (Lv), prefix sum of loads up to

it (Li), its load (li) and the total number of processors (p), by this equation,

di = �p ∗Li − li/2

Lv� (12.1)

So, every object can decide its destination processor in parallel through a parallel

118

Li = Prefix sumLv = Sum of all loadsp = no. of pesli = load of objectdi = destination pe




25

5 9 1 2 3 1 6 4

5 14 10 3 5 4 7 10

5 14 15 17 15 7 12 14

5 14 15 17 20 21 27 31

v1 v2 v3 v4 v5 v6 v7 v8


to a certain object















119

the basic technique in a simpler context.

2. The second scenario is where we have a two-dimensional array of objects where

each object communicates with two immediate neighbors in its row and col-

umn. We wish to map this group of objects on to a 2D mesh of processors.

12.1 Mapping of a 1D Ring

Problem: Load balancing a 1D array of v objects which communicate in a ring

pattern to a 1D linear array of p processors.

Solution: We want to map these objects on to processors while considering the

load of each object and the communication patterns among the objects. In order to

optimize communication, we want to place objects next to each other on the same

processor as much as possible and cross processor boundaries only for ensuring load

balance. We assume that the IDs of objects denote the nearness in terms of who

communicates with whom. Hence the problem reduces to finding contiguous groups

of objects in the 1D array such that the load on all processors is nearly the same.

We arrange the objects virtually by their IDs and perform a prefix sum in parallel

between them based on the object loads. At the conclusion of a prefix sum, every

object knows the sum of loads of all objects that appear before it (Figure 12.1).

Then the last object broadcasts the sum of loads of all objects so that every object

knows the global load of the system. Each object i, can calculate its destination

processor (di), based on the total load of all objects (Lv), prefix sum of loads up to

it (Li), its load (li) and the total number of processors (p), by this equation,

di = �p ∗Li − li/2

Lv� (12.1)

So, every object can decide its destination processor in parallel through a parallel

118

Li = Prefix sumLv = Sum of all loadsp = no. of pesli = load of objectdi = destination pe

• Each object now decides which processor it should be on


2D stencil to 2D mesh

26



• Linearize using a space filling curve• Perform 1D parallel prefix to obtain linear index of processor

26

Solution 2: Another solution is to use a parallel prefix in 1D as we did in the

previous section. To do this, we have to linearize the objects in some fashion. Space

filling curves can be used to map the 2D object grid to a 1D line [[87, 88]]. Space

filling curves preserve the neighborhood properties of the objects in 2D. Figure 12.2

shows the linearization of an object grid of dimensions 32× 32 and a processor grid

of dimensions 8× 8.

Figure 12.2: Hilbert order linearization of an object grid of dimensions 32× 32 anda processor mesh of dimensions 8× 8

Having linearized the object grid, we can perform a parallel prefix on 1D array of

objects and obtain a destination processor for each object. This processor number

is the linearized index of each processor if we create a space filling curve for the 2D

processor mesh. Hence, based on this linearized index, we can obtain the x and y

coordinates of the processor by decoding using the same logic used for generating

space filling curves. The hope is that nearby objects in the 2D object graph end up

close to one another on the 2D processor mesh.

121



• Linearize using a space filling curve• Perform 1D parallel prefix to obtain linear index of processor

26















121















121

• Decode (x, y) of processor using a space filling curve


Running Time

• Benefits from distributed mapping• Memory usage per processor - O(v/p) compared to O(v)• No communication bottleneck at a single processor• Faster decision time: O(log p)

• Time (in seconds) for load balancing on Cray XT5

27

# cores 4096 16384

1 million1.81 6.77

1 million3.29 3.49

4 million12.96 9.59

4 million12.65 5.68


Results: 1 million objects on 4k cores

28

0.1

1

10

100

Trial 1 Trial 2 Trial 3 Trial 4 Trial 5

Hop

s pe

r by

te

0

0.35

0.7

1.05

1.4


Max

to

Avg

Rat

io

RandomDefaultTopology

Reduction in Hop-bytes Ratio of maximum to average load

di = distancebi = bytesn = no. of messages

5 Hop-bytes as an Evaluation Metric

The volume of inter-processor communication can be characterized by the hop-bytes

metric which is the weighted sum of message sizes where the weights are the number

of hops (links) traveled by the respective messages. Hop-bytes can be calculated by

the equation,

HB =

n�

i=1

di × bi (5.1)

where di is the number of links traversed by message i and bi is the message size in

bytes for message i and the summation is over all messages sent.

Hop-bytes is an indication of the average communication load on each link on the

network. This assumes that the application generates nearly uniform traffic over all

links in the partition. The metric does not give an indication of hot-spots generated

on specific links on the network but is an easily derivable metric and correlates well

with actual application performance.

In VLSI circuit design and early parallel computing work, emphasis was placed

on another metric called maximum dilation which is defined as,

d(e) = max{di|ei ∈ E} (5.2)

where di is the dilation of the edge ei. This metric aims at minimizing the longest

length of the wire in a circuit. We claim that reducing the largest number of links

traveled by any message is not as critical as reducing the average hops across all

messages.

32


Results: 1 million objects on 4k cores

28

0.1

1

10

100


Hop

s pe

r by

te

0

0.35

0.7

1.05

1.4


Max

to

Avg

Rat

io

RandomDefaultTopology

Reduction in Hop-bytes Ratio of maximum to average load


Summary

• Topology awareness required at various levels:• Resource Managers (allocator, job launcher): topology aware

scheduling

• MPI implementations (optimized collectives)• Software support for querying network topology• Distributed graph representations for communication structure

• Scalable distributed algorithms• With global view of load and partial view of communication

structure

29

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551

LLNL-PRES-529376

SIAM Conference on Parallel Processing ◆ February 15, 2012

Questions?Abhinav Bhatele, Automating Topology Aware Mapping for Supercomputers, PhD Thesis, Department of Computer Science, University of Illinois.

http://charm.cs.uiuc.edu/research/topology
http://charm.cs.uiuc.edu/research/topologyhttp://charm.cs.uiuc.edu/research/topology

topology’aware’resource’allocation’and’mapping ......2012/02/15 · ddcmd: pppm algorithm...

Documents