topology’aware’resource’allocation’and’mapping ......2012/02/15  · ddcmd: pppm algorithm...

46
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 LLNL-PRES-529376 SIAM Conference on Parallel Processing February 15, 2012 Topology aware resource allocation and mapping challenges at exascale Abhinav Bhatele and Laxmikant V. Kale

Upload: others

Post on 07-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551

    LLNL-PRES-529376

    SIAM  Conference  on  Parallel  Processing  ◆  February  15,  2012

    Topology  aware  resource  allocation  and  mapping  challenges  at  exascale  

    Abhinav  Bhatele  and  Laxmikant  V.  Kale

  • IBM Blue Gene

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Qbox: FPMD Simulations

    • Two-dimensional process grid

    • Collective communication over rows and columns

    • MPI_Bcast (14.5%)and MPI_Allreduce (3.7%) on 8K nodes

    3

    performance of MPI_Bcast, since this dominates communication costs in Qbox (see Table 1). Figure 2 shows the communication pattern of a single broadcast on a 4x4 plane of BG/L nodes using three eight-node communicators. Broadcasts over a compact rectangle of nodes (left panel), which use the torus network’s broadcast functionality, have the most balanced packet counts as well as the lowest maximum count. When we split the nodes

    across multiple lines resulting in disjoint sets of nodes (middle panel), the communication requires significantly more communication packets with less balanced link utilization. The node mappings in Figure 1c) and 1d), on the other hand, lead to a more balanced link utilization (as illustrated in Figure 2, right panel) and hence to higher overall performance.

    (a) (b)

    (c) (d)

    Figure 1. Illustration of different node mappings for a 64k-node partition. Each color represents the nodes belonging to one 512-node column of the process grid.

    This analysis of Qbox communication led to several node mapping optimizations. In particular, our initial mappings did not optimize the placement of tasks within communicators. We have refined the bipartite mapping shown in Figure 1c to map tasks with a variant of Z ordering within a plane. This modification effectively isolates the subtree of a binomial software tree broadcast within subplanes of the torus. In addition, we observed that substantial time was spent in

    MPI_Type_commit calls in the BLACS library. The types being created were in many cases just contiguous native types, which allowed us to hand-optimize BLACS to eliminate calls in these cases. We are investigating other possible optimizations, including multi-level collective operations [13] that could substantially improve the performance of the middle configuration in Figure 2.

    Francois Gygi et al. Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing (SC '06). ACM, New York, NY, USA

    39.5 TFlop/s 38.2 TFlop/s

    64.0 TFlop/s 64.7 TFlop/s

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    OpenAtom: ab-initio MD

    • Transposes, multicasts and reductions

    • Significant contention on the network

    4

    OPTIMIZING COMMUNICATION FOR CHARM++ APPLICATIONS 7

    !"#$%&'$()*$+%,+$-.)

    /&$+"#$%&

    "-$-&0

    '+$1&0

    "-$-&0

    '+$1&0

    '+$1&0

    "-$-&0

    "-$-&0

    Figure 2. Mapping of different chare arrays to the 3D torus of the machine

    wise communication). The number of planes in GSpace is different from that in RealSpace.GSpace also interacts with the PairCalculator arrays. Each plane of GSpace, G(∗, p) interactswith the corresponding plane, P (∗, ∗, p) of the PairCalculators (plane-wise communication)through multicasts and reductions. So, GSpace interacts state-wise with RealSpace and plane-

    wise with PairCalculators. If all planes of GSpace are placed together, then the transpose

    operation is favored, but if all states of GSpace are placed together, the multicasts/reductions

    are favored. To strike a balance between the two extremes, a hybrid map is built, where a

    subset of planes and states of these three arrays are placed on one processor.

    Mapping GSpace and RealSpace Arrays: Initially, the GSpace array is placed on the torus

    and other objects are mapped relative to GSpace’s mapping. The 3D torus is divided into

    rectangular boxes (which will be referred to as “prisms”) such that the number of prisms is

    equal to the number of the planes in GSpace. The longest dimension of the prism is chosen to

    be same as one dimension of the torus. Each prism is used for all states of one plane of GSpace.

    Within each prism for a specific plane, the states in G(*, p) are laid out in increasing order along

    the long axis of the prism. Once GSpace is mapped, the RealSpace objects are placed. Prisms

    perpendicular to the GSpace prisms are created which are formed by including processors

    holding all planes for a particular state of GSpace, G(s, ∗). These prisms are perpendicular tothe GSpace prisms and the corresponding states of RealSpace, R(s, ∗) are mapped on to theseprisms. Figure 2 shows the GSpace objects (on the right) and the RealSpace objects (in the

    foreground) being mapped along the long dimension of the torus (box in the center).

    Mapping of Density Arrays: RhoR objects communicate with RealSpace plane-wise and hence

    Rρ(p, ∗) have to be placed close to R(∗, p). To achieve this, we start with the centroid of theprism used by R(∗, p) and place RhoR objects in proximity to it. RhoG objects, Gρ(p) aremapped near RhoR objects, Rρ(p, ∗) but not on the same processors as RhoR to maximizeoverlap. The density computation is inherently smaller and hence occupies the center of the

    torus.

    Copyright c� 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 00:1–7Prepared using cpeauth.cls

    0

    2.75

    5.5

    8.25

    11

    1024 2048 4096 8192T

    ime

    per

    step

    (s)

    Number of cores

    Default MappingTopology Mapping

    Abhinav Bhatele, Eric Bohm, and Laxmikant V. Kale, 2011. Optimizing communication for Charm++ applications by reducing network contention. Concurr. Comput. : Pract. Exper. 23, 2, pp. 211-222, February 2011.

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    ddcMD: PPPM algorithm

    • Optimize communication within short-range and long-range and also between the two

    • Reduction in comm. time:• short-range: 3 times• long-range (FFT): 2 times• between the two: 50%

    • Overall reduction: 36% on 32K nodes (64 x 32 x 16)

    5

    Figure 2: Task layout on the BlueGene/P torus. Cyan arrowsindicate communication from MD tasks to collector tasks, bluearrows indicate communication from collector tasks to FFTtasks.

    is the reduction of the contributions to ρ at each point from thesemultiple tasks to a single sum.

    To perform the reduction we nominate a subset of the particletasks as “collector” tasks (see Fig. 2). Each mesh point is uniquelyassigned to a collector task that is responsible for gathering all con-tributions to ρ for that mesh point and performing the sum. Thenumber and arrangement of the collector tasks is a tunable param-eter and all communication is local.

    In Stage 2 of the communication, each collector task sends meshinformation to the appropriate mesh task using MPI_Isend. Wethink of this stage as a gather operation since the mesh data is beinggathered to the mesh tasks. This communication is long-range, butcan be efficiently organized.

    Once the long-range portion of the potential has been calculatedwe perform the communication stages again but in reverse order.The mesh tasks scatter mesh data back to the collector tasks whichin turn send values to the neighboring particle tasks. To maximizethe overlap of communication of computation the MPI_Irecvson the collector tasks are posted before we start the pair calculation.This allows data to begin moving from the mesh tasks as soon as itis available.

    In benchmark calculations on Dawn using 144,384 processors(9,216 mesh tasks and 135,168 particle tasks, of which 22,400 arecollector tasks) we observe that when work is properly balancedbetween the particle and mesh tasks there is a pronounced asym-metry in the communication times between the particle and meshtasks. Although the mesh tasks spend roughly 15% of total run-time waiting for data to arrive, the particle tasks spend under 2% ofthe total runtime sending it. Asynchronous communication allowsthe particle tasks to continue with other work while the commu-nication proceeds. Hence, we have successfully accomplished thegoal of minimizing communication time on the particle tasks byoverlapping the communication of mesh data with the explicit paircomputation.

    This success demonstrates the effectiveness of the direct mem-ory access (DMA) engine that was added to the BG/P design asan improvement over BG/L[23]. The DMA is directly coupled tothe L3 (shared) cache on each node and is responsible for sendingand receiving data to and from the torus network. The CPU is thusrelieved of these tasks and is free to continue on to other compu-tations. From comparisons with benchmark simulations performedon BG/L, it is clear that there is a significant benefit from the DMA.

    The two-stage approach just described has at least two advan-tages over a single stage method in which each particle task simplysends all of the mesh points it populates to the appropriate meshtasks and the reduction of partial sums is performed on the meshtasks. The first advantage is a reduction of communication band-width from the particle to the mesh tasks. Although the number ofmesh points to which an particle task contributes varies with ng , forour typical problems of interest it is roughly 2–5 times the numberthat lie strictly within its computational domain. Hence a singlestage solution would require 2–5 times the network bandwidth tocomplete communication in the same time. In the two-stage ap-proach the mesh points are gathered and reduced locally so a largernumber of torus links can be active increasing the aggregate band-width available to communicate mesh points. A second advantageis that the number of collector tasks can be tuned to optimize totalcommunication cost. Changing the number of collector tasks al-lows trade offs between the number of messages sent to each meshtask in Stage 2 (with corresponding changes in message size) andthe bandwidth available for the reduction in Stage 1.

    4.3 LayoutFor a 3D torus network, the assignment of MPI tasks onto com-

    pute nodes at specific torus coordinates can significantly impactparallel efficiency at full machine scale. It is necessary to opti-mize communication both within the short-range and long-rangesubcommunicators, as well as between the two. For the latter, wefocused on splitting the torus into separate sections for the particleand mesh tasks such that communication between the “collector”tasks described in section 4.2 and the mesh tasks takes place along asingle torus dimension to reduce contention and avoid bottlenecks,as illustrated in Figure 2. The tasks are then ordered within eachsubcommunicator to provide nearest neighbor communication forspatially adjacent particle tasks and reduce transpose communica-tion times for the mesh tasks.

    For a system of 1.2 billion particles on a 64 × 32 × 16 BlueGene/P partition (32,768 nodes, 131,072 tasks), we see a signif-icant decrease in communication times using a custom task mapconstructed as described above compared with the default (TXYZ)layout. The total run time decreased by 36% when the custom map-ping was used, with the greatest improvement being seen in theintra-particle task communication times, which decreased by a fac-tor of 3, likely due to the poor correspondence of the default map-ping to the simulation box shape. Communication within the 3DFFT was decreased by over a factor of 2, and communication be-tween the particle and mesh tasks was more than 50% faster. Theseresults highlight the need to carefully understand and optimize thecommunication patterns on large torus networks.

    4.4 FFT ImplementationAs described in Section 2.1, the long-range interaction term in

    Eqn. 3 involves the use of a 3D Fourier transform between thereal-space (real-valued) density and the k-space (complex-valued)density. To obtain optimal 3D FFT performance in the massively-parallel regime a custom real-to-complex 3D Fast Fourier Trans-form implementation (bigFFT) was developed using a 2D decom-

    D. F. Richards et al. Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09). ACM, New York, NY, USA

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Other applications

    • Blue Matter - Blake G. Fitch et al. Blue matter: approaching the limits of concurrency for classical molecular dynamics. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing (SC '06). ACM, New York, NY, USA

    • NAMD - Abhinav Bhatele et al. Dynamic topology aware load balancing algorithms for molecular dynamics applications. In Proceedings of the 23rd international conference on Supercomputing (ICS '09). ACM, New York, NY, USA

    • NAS BT, CG, MG - Brian E. Smith et al. Performance Effects of Node Mappings on the IBM Blue Gene/L Machine. In Euro-Par, pages 1005–1013, 2005

    • SAGE, UMT2000 - G. Bhanot et al. Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development, 49(2/3):489–500, 2005

    • GTC Particle-in-cell - Leonid Oliker et al. Scientific Application Performance on Candidate PetaScale Platforms. In Proceedings of IEEE Parallel and Distributed Processing Symposium (IPDPS), March 2007

    6

  • Cray XT/XE

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    pF3D

    • Slowest communicators are those that are most spread out on the network

    • Runs on Cielo - Cray XE6 at LANL

    8

    (a) (b) (c) (d)

    Fig. 4: Message passing during 2D FFTs for a single xy-slab. (a) Communication patterns for xyzt in x-direction, color

    coded by communicator. Only links in a single direction are used. (b) Communication for block-mapped data, color coded by

    communicator. The communicating nodes are distributed within the 4x4x4 block and uses links in multiple directions. (c) xyzt

    in the y-direction uses vertical links. (d) Y-communication for the block mapped scheme uses links in multiple directions.

    (a) (b)

    Fig. 5: Message passing for y-communication on Cielo. (a) The locations on the Cielo torus are shown for the five fastest y-

    communicators. (b) The locations on the Cielo torus are shown for the five fastest y-communicators. The slowest communicators

    are all ”long and skinny” and are thus subject to contention for bandwidth along the links in the z-direction.

    The cielo message passing performance doesn’t show a

    strong trend as a function of the number of processes. The

    variability among the three runs with 1024 processes appears

    to be bigger than differences due to the number of processes.

    All runs used a 16x16xN decomposition so that X messages

    are all passed on node. All Y messages go off node and are

    significantly slower than the X messages.

    The message passing rate during the 32K process Cielo run

    varied between 80 and 100 MB/s for different batch jobs.

    The variability appears to be due to the placement of the

    communicators on the interconnect. The run used a 16x16x128

    decomposition. The message passing rate for Y messages is

    probably about 1 GB/s per node. That is significantly less than

    the roughly 5 GB/s rate that MPI benchmarks achieve.

    We investigated the effect of varying the MPI eager limit.

    It made a noticeable difference in smaller runs but does not

    appear to help for large runs. The likely explanation is that a

    large run is statistically very likely to have a communicator

    that is slow and that masks any effects of varying the MPI

    parameter. We plan to investigate custom mappings between

    the physical location of domains and their placement on the

    interconnect in future work.

    VI. MTBI RESULTS

    Large computers like Cielo have large numbers of compo-

    nents. In spite of carefully designed RAS systems, applications

    Steven H. Langer et al. Cielo Full-System Simulations of Multi-Beam Laser-Plasma Interaction in NIF Experiments, In Proceedings of Cray User Group, 2011.

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    OpenAtom: ab-initio MD

    9

    OPTIMIZING COMMUNICATION FOR CHARM++ APPLICATIONS 7

    !"#$%&'$()*$+%,+$-.)

    /&$+"#$%&

    "-$-&0

    '+$1&0

    "-$-&0

    '+$1&0

    '+$1&0

    "-$-&0

    "-$-&0

    Figure 2. Mapping of different chare arrays to the 3D torus of the machine

    wise communication). The number of planes in GSpace is different from that in RealSpace.GSpace also interacts with the PairCalculator arrays. Each plane of GSpace, G(∗, p) interactswith the corresponding plane, P (∗, ∗, p) of the PairCalculators (plane-wise communication)through multicasts and reductions. So, GSpace interacts state-wise with RealSpace and plane-

    wise with PairCalculators. If all planes of GSpace are placed together, then the transpose

    operation is favored, but if all states of GSpace are placed together, the multicasts/reductions

    are favored. To strike a balance between the two extremes, a hybrid map is built, where a

    subset of planes and states of these three arrays are placed on one processor.

    Mapping GSpace and RealSpace Arrays: Initially, the GSpace array is placed on the torus

    and other objects are mapped relative to GSpace’s mapping. The 3D torus is divided into

    rectangular boxes (which will be referred to as “prisms”) such that the number of prisms is

    equal to the number of the planes in GSpace. The longest dimension of the prism is chosen to

    be same as one dimension of the torus. Each prism is used for all states of one plane of GSpace.

    Within each prism for a specific plane, the states in G(*, p) are laid out in increasing order along

    the long axis of the prism. Once GSpace is mapped, the RealSpace objects are placed. Prisms

    perpendicular to the GSpace prisms are created which are formed by including processors

    holding all planes for a particular state of GSpace, G(s, ∗). These prisms are perpendicular tothe GSpace prisms and the corresponding states of RealSpace, R(s, ∗) are mapped on to theseprisms. Figure 2 shows the GSpace objects (on the right) and the RealSpace objects (in the

    foreground) being mapped along the long dimension of the torus (box in the center).

    Mapping of Density Arrays: RhoR objects communicate with RealSpace plane-wise and hence

    Rρ(p, ∗) have to be placed close to R(∗, p). To achieve this, we start with the centroid of theprism used by R(∗, p) and place RhoR objects in proximity to it. RhoG objects, Gρ(p) aremapped near RhoR objects, Rρ(p, ∗) but not on the same processors as RhoR to maximizeoverlap. The density computation is inherently smaller and hence occupies the center of the

    torus.

    Copyright c� 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 00:1–7Prepared using cpeauth.cls

    Abhinav Bhatele, Eric Bohm, and Laxmikant V. Kale, 2011. Optimizing communication for Charm++ applications by reducing network contention. Concurr. Comput. : Pract. Exper. 23, 2, pp. 211-222, February 2011.

    0

    2

    4

    6

    8

    512 1024 2048T

    ime

    per

    step

    (s)

    Number of cores

    Default MappingTopology Mapping

    • Job schedulers on Cray are not topology aware typically

    • Performance Benefit at 2048 cores: 40% (XT3), 45% (BG/P), 41% (BG/L)

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Other work

    • M. Muller and Michael Resch. PE mapping and the congestion problem in the T3E. In Proceedings of the Fourth European Cray-SGI MPP Workshop, Garching, Germany, 1998

    • Thierry Cornu and Michel Pahud. Contention in the Cray T3D Communi- cation Network. In Euro-Par ’96: Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II, pages 689–696, Lon- don, UK, 1996. Springer-Verlag

    • Eduardo Huedo and Manuel Prieto and Ignacio Martin Llorente and Fran- cisco Tirado. Impact of PE Mapping on Cray T3E Message-Passing Per- formance. In Euro-Par ’00: Proceedings from the 6th International Euro- Par Conference on Parallel Processing, pages 199–207, London, UK, 2000. Springer-Verlag

    • Deborah Weisser, Nick Nystrom, Chad Vizino, Shawn T. Brown, and John Urbanic. Optimizing Job Placement on the Cray XT3. 48th Cray User Group Proceedings, 2006

    10

  • Infiniband clusters

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Optimizing collective communication

    12

    D. K. Panda, K. Schulz, B. Barth and A. Majumdar,Topology-Aware MPI Communication and Scheduling for Petascale Systems, Poster, NSF, STCI, https://confluence.pegasus.isi.edu/download/attachments/5242944/topology-aware-poster.pdf

    Topology-Aware MPI Communication and Scheduling for Petascale SystemsPIs: D. K. Panda (The Ohio State University), K. Schulz and B. Barth (Texas Advanced Computing Center)

    and A. Majumdar (San Diego Supercomputer Center)

    Motivation Vision and Problem Statement Framework and Approach

    Network Topology of TACC Ranger Current Job Allocation Strategies and their Impact: A Case Study with TACC Ranger

    Application Level Performance Impact: A Case Study with MPCUGLES

    Topology Aware MPI_Gather Design

    Topology Aware MPI_Scatter Design Conclusions and Continuing Work

    Job allocation for the entire system Jobs using 16-4800 cores

    Jobs using 4800-16000 cores Jobs using 16000-64000 cores

    Table 1: Data Collected from TACC Ranger System

    Modern networks (like InfiniBand and 10 GigE) are capable of providing topology and routing information

    Research Challenge:

    Can the next-generation petascale systems provide topology-aware MPI communication, mapping and scheduling which can improve performance and scalability for a range of scientific applications?

    • On the left, we compare performance of MPCUGLES on the Normal Batch Queue as compared to runs conducted on an exclusive queue

    • We observe that performance may be impacted by up to 15%• On the right, we compare performance of MPCUGLES on the Normal Batch

    Queue, but with special randomization of hostfiles

    • We observe that there is greater variance in performance with up to 16% difference between best case and worst case runs

    Topology AwareTask Mapping

    Topology Aware Communication

    Collectives Point To PointIntegratedEvaluation

    Topology Information Interface

    Topology Graph Network Status Graph

    Dynamic State & Topology Management Framework

    Unified Abstraction Layer

    EthernetNetwork Management System Topology

    DiscoveryTraffic

    Monitoring

    Enhanced Subnet Management Layer

    High Performance Interconnect

    Job Scheduler

    Topology-Aware Scheduling

    Performance Feedback

    Dependency

    Legend:

    Profiling Information

    TurbulencePrediction

    EarthquakeModeling

    MPI Applications

    FlowModeling

    KineticSimulation

    Application Hints

    Rack 1 Rack 2 Rack 82

    InfiniBand Switch (Magnum) 1 InfiniBand Switch (Magnum) 2

    • Ranger's compute nodes are a blade-based configuration• 12 blades in a chassis, 4 chassis in a rack, 82 racks in all for a total of 4096

    compute nodes.

    • Each chassis embeds a NEM (network express module) combining a 24-port leaf switch with 12 dual-rail SDR HCA's, one per node.

    • Each NEM is connected to the core Magnum switch(es) with 4, 12X connectors, 2 to each Magnum.

    80 1 2 3 4 5 6 7

    1.3

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    Run #

    Perf

    orm

    ance

    Nor

    mal

    ized

    to E

    xclu

    sive

    Que

    ue

    Batch Queue with Normal Ordering on 192 cores

    60 1 2 3 4 5

    1.3

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    Run #

    Perf

    orm

    ance

    Nor

    mal

    ized

    to E

    xclu

    sive

    Que

    ue

    Batch Queue with Random Ordering on 192 cores

    Research Questions:

    (1) What are the topology aware communication and scheduling requirements of petascale applications?

    (2) How to design a network topology and state management framework with static and dynamic network information?

    (3) How to design topology-aware point-to-point and collective communication schemes?

    (4) How to design topology-aware task mapping and scheduling schemes?

    (5) How to design a flexible topology information interface?

    • On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions

    • We observe that the algorithm is impacted by background traffic• On the right, we compare performance of proposed topology-aware algorithm

    under quiet and busy conditions

    • Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions

    • 23% performance improvement under quiet network conditions and 10% under busy conditions

    • The graphs present the analysis of the jobs run on TACC ranger system in September '09 • There were a total of 19,441multi-node jobs, most of which used 16-4800 cores• We observe that for the majority of jobs, average inter-node distance is significantly more

    than the best possible

    • Modern high-end computing (HEC) systems enable scientists to tackle grand challenge problems

    • Design and deployment of such ultra-scale HEC systems is being fueled by the increasing use of multi-core/many-core architectures and commodity networking technologies like InfiniBand.

    • As a recent example, the TACC Ranger system was deployed with a total of 62,976 cores using a fat-tree InfiniBand interconnect to provide a peak performance of 579 TFlops.

    • Most current petascale applications are written using the Message Passing Interface (MPI) programming model.

    • By necessity, large-scale systems that support MPI are built using hierarchical topologies (multiple levels involving intra-socket, intra-node, intra-blade, intra-rack, and multi-stages within a high-speed switch).

    • Current generation MPI libraries and schedulers do not take into account these various levels for optimizing communications

    • Consequently, this leads to non-optimal performance and scalability for many applications.

    Process Location Number of Hops MPI Latency (us)

    Intra-RackIntra-ChassisInter-Chassis

    Inter-Rack

    0 Hops in Leaf Switch1 Hop in Leaf Switch

    3 Hops Across Spine Switch5 Hops Across Spine Switch

    1.572.042.452.85

    Results of current work:

    • We have observed a major impact on end applications if schedulers and communication libraries are not topology aware

    • We have proposed topology aware collective communication algorithms for MPI_Gather and MPI_Scatter

    • Our proposed algorithms outperform default implementation under both network quiet and busy conditions

    Continuing work:

    • Work towards a topology-aware scheduling scheme• Adapt more collective algorithms dynamically according to topology

    interfacing with schedulers

    • Gather more data from real-world application runs• Integrated solutions will be available in future versions of MVAPICH/

    MVAPICH2 software

    Publications:

    • K. Kandalla, H. Subramoni and D. K. Panda, “Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters : Case Studies wih Scatter and Gather”, Communication Architecture for Clusters (CAC) Workshop, in conjunction with IPDPS 2010

    Additional Personnel:

    • Hari Subramoni, Krishna Kandalla, Sayantan Sur, Karen Tomko (OSU)• Mahidhar Tatineni, Yifeng Cui, Dmitry Pekurovsky (SDSC)

    • We conduct experiments with MVAPICH2 stack, which is one of the most popular MPI implementations over InfiniBand, currently used by more than 1,050 organizations worldwide (http://www.mvapich.cse.ohio-state.edu)

    • On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions

    • We observe that default algorithm is very sensitive to background traffic, with degradation up to 21% for large messages

    • On the right, we compare performance of proposed topology-aware algorithm

    • Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions

    • Over 50% performance improvement even when network was busy

    Topology-Aware MPI Communication and Scheduling for Petascale SystemsPIs: D. K. Panda (The Ohio State University), K. Schulz and B. Barth (Texas Advanced Computing Center)

    and A. Majumdar (San Diego Supercomputer Center)

    Motivation Vision and Problem Statement Framework and Approach

    Network Topology of TACC Ranger Current Job Allocation Strategies and their Impact: A Case Study with TACC Ranger

    Application Level Performance Impact: A Case Study with MPCUGLES

    Topology Aware MPI_Gather Design

    Topology Aware MPI_Scatter Design Conclusions and Continuing Work

    Job allocation for the entire system Jobs using 16-4800 cores

    Jobs using 4800-16000 cores Jobs using 16000-64000 cores

    Table 1: Data Collected from TACC Ranger System

    Modern networks (like InfiniBand and 10 GigE) are capable of providing topology and routing information

    Research Challenge:

    Can the next-generation petascale systems provide topology-aware MPI communication, mapping and scheduling which can improve performance and scalability for a range of scientific applications?

    • On the left, we compare performance of MPCUGLES on the Normal Batch Queue as compared to runs conducted on an exclusive queue

    • We observe that performance may be impacted by up to 15%• On the right, we compare performance of MPCUGLES on the Normal Batch

    Queue, but with special randomization of hostfiles

    • We observe that there is greater variance in performance with up to 16% difference between best case and worst case runs

    Topology AwareTask Mapping

    Topology Aware Communication

    Collectives Point To PointIntegratedEvaluation

    Topology Information Interface

    Topology Graph Network Status Graph

    Dynamic State & Topology Management Framework

    Unified Abstraction Layer

    EthernetNetwork Management System Topology

    DiscoveryTraffic

    Monitoring

    Enhanced Subnet Management Layer

    High Performance Interconnect

    Job Scheduler

    Topology-Aware Scheduling

    Performance Feedback

    Dependency

    Legend:

    Profiling Information

    TurbulencePrediction

    EarthquakeModeling

    MPI Applications

    FlowModeling

    KineticSimulation

    Application Hints

    Rack 1 Rack 2 Rack 82

    InfiniBand Switch (Magnum) 1 InfiniBand Switch (Magnum) 2

    • Ranger's compute nodes are a blade-based configuration• 12 blades in a chassis, 4 chassis in a rack, 82 racks in all for a total of 4096

    compute nodes.

    • Each chassis embeds a NEM (network express module) combining a 24-port leaf switch with 12 dual-rail SDR HCA's, one per node.

    • Each NEM is connected to the core Magnum switch(es) with 4, 12X connectors, 2 to each Magnum.

    80 1 2 3 4 5 6 7

    1.3

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    Run #

    Perf

    orm

    ance

    Nor

    mal

    ized

    to E

    xclu

    sive

    Que

    ue

    Batch Queue with Normal Ordering on 192 cores

    60 1 2 3 4 5

    1.3

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    Run #

    Perf

    orm

    ance

    Nor

    mal

    ized

    to E

    xclu

    sive

    Que

    ue

    Batch Queue with Random Ordering on 192 cores

    Research Questions:

    (1) What are the topology aware communication and scheduling requirements of petascale applications?

    (2) How to design a network topology and state management framework with static and dynamic network information?

    (3) How to design topology-aware point-to-point and collective communication schemes?

    (4) How to design topology-aware task mapping and scheduling schemes?

    (5) How to design a flexible topology information interface?

    • On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions

    • We observe that the algorithm is impacted by background traffic• On the right, we compare performance of proposed topology-aware algorithm

    under quiet and busy conditions

    • Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions

    • 23% performance improvement under quiet network conditions and 10% under busy conditions

    • The graphs present the analysis of the jobs run on TACC ranger system in September '09 • There were a total of 19,441multi-node jobs, most of which used 16-4800 cores• We observe that for the majority of jobs, average inter-node distance is significantly more

    than the best possible

    • Modern high-end computing (HEC) systems enable scientists to tackle grand challenge problems

    • Design and deployment of such ultra-scale HEC systems is being fueled by the increasing use of multi-core/many-core architectures and commodity networking technologies like InfiniBand.

    • As a recent example, the TACC Ranger system was deployed with a total of 62,976 cores using a fat-tree InfiniBand interconnect to provide a peak performance of 579 TFlops.

    • Most current petascale applications are written using the Message Passing Interface (MPI) programming model.

    • By necessity, large-scale systems that support MPI are built using hierarchical topologies (multiple levels involving intra-socket, intra-node, intra-blade, intra-rack, and multi-stages within a high-speed switch).

    • Current generation MPI libraries and schedulers do not take into account these various levels for optimizing communications

    • Consequently, this leads to non-optimal performance and scalability for many applications.

    Process Location Number of Hops MPI Latency (us)

    Intra-RackIntra-ChassisInter-Chassis

    Inter-Rack

    0 Hops in Leaf Switch1 Hop in Leaf Switch

    3 Hops Across Spine Switch5 Hops Across Spine Switch

    1.572.042.452.85

    Results of current work:

    • We have observed a major impact on end applications if schedulers and communication libraries are not topology aware

    • We have proposed topology aware collective communication algorithms for MPI_Gather and MPI_Scatter

    • Our proposed algorithms outperform default implementation under both network quiet and busy conditions

    Continuing work:

    • Work towards a topology-aware scheduling scheme• Adapt more collective algorithms dynamically according to topology

    interfacing with schedulers

    • Gather more data from real-world application runs• Integrated solutions will be available in future versions of MVAPICH/

    MVAPICH2 software

    Publications:

    • K. Kandalla, H. Subramoni and D. K. Panda, “Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters : Case Studies wih Scatter and Gather”, Communication Architecture for Clusters (CAC) Workshop, in conjunction with IPDPS 2010

    Additional Personnel:

    • Hari Subramoni, Krishna Kandalla, Sayantan Sur, Karen Tomko (OSU)• Mahidhar Tatineni, Yifeng Cui, Dmitry Pekurovsky (SDSC)

    • We conduct experiments with MVAPICH2 stack, which is one of the most popular MPI implementations over InfiniBand, currently used by more than 1,050 organizations worldwide (http://www.mvapich.cse.ohio-state.edu)

    • On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions

    • We observe that default algorithm is very sensitive to background traffic, with degradation up to 21% for large messages

    • On the right, we compare performance of proposed topology-aware algorithm

    • Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions

    • Over 50% performance improvement even when network was busy

    Topology-Aware MPI Communication and Scheduling for Petascale SystemsPIs: D. K. Panda (The Ohio State University), K. Schulz and B. Barth (Texas Advanced Computing Center)

    and A. Majumdar (San Diego Supercomputer Center)

    Motivation Vision and Problem Statement Framework and Approach

    Network Topology of TACC Ranger Current Job Allocation Strategies and their Impact: A Case Study with TACC Ranger

    Application Level Performance Impact: A Case Study with MPCUGLES

    Topology Aware MPI_Gather Design

    Topology Aware MPI_Scatter Design Conclusions and Continuing Work

    Job allocation for the entire system Jobs using 16-4800 cores

    Jobs using 4800-16000 cores Jobs using 16000-64000 cores

    Table 1: Data Collected from TACC Ranger System

    Modern networks (like InfiniBand and 10 GigE) are capable of providing topology and routing information

    Research Challenge:

    Can the next-generation petascale systems provide topology-aware MPI communication, mapping and scheduling which can improve performance and scalability for a range of scientific applications?

    • On the left, we compare performance of MPCUGLES on the Normal Batch Queue as compared to runs conducted on an exclusive queue

    • We observe that performance may be impacted by up to 15%• On the right, we compare performance of MPCUGLES on the Normal Batch

    Queue, but with special randomization of hostfiles

    • We observe that there is greater variance in performance with up to 16% difference between best case and worst case runs

    Topology AwareTask Mapping

    Topology Aware Communication

    Collectives Point To PointIntegratedEvaluation

    Topology Information Interface

    Topology Graph Network Status Graph

    Dynamic State & Topology Management Framework

    Unified Abstraction Layer

    EthernetNetwork Management System Topology

    DiscoveryTraffic

    Monitoring

    Enhanced Subnet Management Layer

    High Performance Interconnect

    Job Scheduler

    Topology-Aware Scheduling

    Performance Feedback

    Dependency

    Legend:

    Profiling Information

    TurbulencePrediction

    EarthquakeModeling

    MPI Applications

    FlowModeling

    KineticSimulation

    Application Hints

    Rack 1 Rack 2 Rack 82

    InfiniBand Switch (Magnum) 1 InfiniBand Switch (Magnum) 2

    • Ranger's compute nodes are a blade-based configuration• 12 blades in a chassis, 4 chassis in a rack, 82 racks in all for a total of 4096

    compute nodes.

    • Each chassis embeds a NEM (network express module) combining a 24-port leaf switch with 12 dual-rail SDR HCA's, one per node.

    • Each NEM is connected to the core Magnum switch(es) with 4, 12X connectors, 2 to each Magnum.

    80 1 2 3 4 5 6 7

    1.3

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    Run #

    Perf

    orm

    ance

    Nor

    mal

    ized

    to E

    xclu

    sive

    Que

    ue

    Batch Queue with Normal Ordering on 192 cores

    60 1 2 3 4 5

    1.3

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    Run #

    Perf

    orm

    ance

    Nor

    mal

    ized

    to E

    xclu

    sive

    Que

    ue

    Batch Queue with Random Ordering on 192 cores

    Research Questions:

    (1) What are the topology aware communication and scheduling requirements of petascale applications?

    (2) How to design a network topology and state management framework with static and dynamic network information?

    (3) How to design topology-aware point-to-point and collective communication schemes?

    (4) How to design topology-aware task mapping and scheduling schemes?

    (5) How to design a flexible topology information interface?

    • On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions

    • We observe that the algorithm is impacted by background traffic• On the right, we compare performance of proposed topology-aware algorithm

    under quiet and busy conditions

    • Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions

    • 23% performance improvement under quiet network conditions and 10% under busy conditions

    • The graphs present the analysis of the jobs run on TACC ranger system in September '09 • There were a total of 19,441multi-node jobs, most of which used 16-4800 cores• We observe that for the majority of jobs, average inter-node distance is significantly more

    than the best possible

    • Modern high-end computing (HEC) systems enable scientists to tackle grand challenge problems

    • Design and deployment of such ultra-scale HEC systems is being fueled by the increasing use of multi-core/many-core architectures and commodity networking technologies like InfiniBand.

    • As a recent example, the TACC Ranger system was deployed with a total of 62,976 cores using a fat-tree InfiniBand interconnect to provide a peak performance of 579 TFlops.

    • Most current petascale applications are written using the Message Passing Interface (MPI) programming model.

    • By necessity, large-scale systems that support MPI are built using hierarchical topologies (multiple levels involving intra-socket, intra-node, intra-blade, intra-rack, and multi-stages within a high-speed switch).

    • Current generation MPI libraries and schedulers do not take into account these various levels for optimizing communications

    • Consequently, this leads to non-optimal performance and scalability for many applications.

    Process Location Number of Hops MPI Latency (us)

    Intra-RackIntra-ChassisInter-Chassis

    Inter-Rack

    0 Hops in Leaf Switch1 Hop in Leaf Switch

    3 Hops Across Spine Switch5 Hops Across Spine Switch

    1.572.042.452.85

    Results of current work:

    • We have observed a major impact on end applications if schedulers and communication libraries are not topology aware

    • We have proposed topology aware collective communication algorithms for MPI_Gather and MPI_Scatter

    • Our proposed algorithms outperform default implementation under both network quiet and busy conditions

    Continuing work:

    • Work towards a topology-aware scheduling scheme• Adapt more collective algorithms dynamically according to topology

    interfacing with schedulers

    • Gather more data from real-world application runs• Integrated solutions will be available in future versions of MVAPICH/

    MVAPICH2 software

    Publications:

    • K. Kandalla, H. Subramoni and D. K. Panda, “Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters : Case Studies wih Scatter and Gather”, Communication Architecture for Clusters (CAC) Workshop, in conjunction with IPDPS 2010

    Additional Personnel:

    • Hari Subramoni, Krishna Kandalla, Sayantan Sur, Karen Tomko (OSU)• Mahidhar Tatineni, Yifeng Cui, Dmitry Pekurovsky (SDSC)

    • We conduct experiments with MVAPICH2 stack, which is one of the most popular MPI implementations over InfiniBand, currently used by more than 1,050 organizations worldwide (http://www.mvapich.cse.ohio-state.edu)

    • On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions

    • We observe that default algorithm is very sensitive to background traffic, with degradation up to 21% for large messages

    • On the right, we compare performance of proposed topology-aware algorithm

    • Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions

    • Over 50% performance improvement even when network was busy

    https://confluence.pegasus.isi.edu/download/attachments/5242944/topology-aware-poster.pdfhttps://confluence.pegasus.isi.edu/download/attachments/5242944/topology-aware-poster.pdf

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Mapping is important

    13

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Mapping is important

    • As evidenced by several applications on Blue Gene

    13

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Mapping is important

    • As evidenced by several applications on Blue Gene• Also on Cray machines, even though:• Link bandwidth - 3.8 GB/s (XT3), 0.425 (BG/P), 0.175 (BG/L)• Bytes per flop - 8.77 (XT3), 0.375 (BG/P and BG/L)

    13

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Mapping is important

    • As evidenced by several applications on Blue Gene• Also on Cray machines, even though:• Link bandwidth - 3.8 GB/s (XT3), 0.425 (BG/P), 0.175 (BG/L)• Bytes per flop - 8.77 (XT3), 0.375 (BG/P and BG/L)

    • Even more important now that:• Bytes per flop that the network can handle is reducing - 8.77

    (XT3), 1.36 (XT4), 0.23 (XT5), 0.23 - 0.46 (XE6)★

    13

    ★ Based on BigBen (XT3, PSC), Jaguar (XT4/XT5, ORNL) and Hopper (XE6, NERSC)

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Wormhole Routing

    • Ni et al. 1993; Oh et al. 1997 - Equation for modeling message latencies:

    • Relatively small sized supercomputers• It was safe to assume message latencies were independent

    of distance

    sharing network resources. For common networks with asymptotically inadequate

    link bandwidth, chances of contention increase as messages travel farther and far-

    ther. Network congestion on a link slows down all messages passing through that

    link. Delays in message delivery can affect overall application performance. Thus, it

    becomes necessary to consider the topology of the machine while mapping parallel

    applications to job partitions.

    This dissertation will demonstrate that it is not wise to assume that message la-

    tencies are independent of the distance a message travels. This assumption has been

    supported all these years by the advantages of virtual cut-through and wormhole

    routing suggesting that the message latency is independent of the distance in ab-

    sence of blocking [[5–12]]. When virtual cut-through or wormhole routing is deployed,

    message latency is modeled by the equation,

    LfB

    ∗ D + LB

    (1.1)

    where Lf is the length of each flit, B is the link bandwidth, D is the number of

    links (hops) traversed and L is the length of the message. In absence of blocking, for

    sufficiently large messages (where Lf

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Topology friendly supercomputers?

    15

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Topology friendly supercomputers?

    • Topology aware resource allocation

    15

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Topology friendly supercomputers?

    • Topology aware resource allocation• Ability to query the software stack for network topology

    information:

    • At job allocation time/pre-launch• During runtime

    15

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Topology friendly supercomputers?

    • Topology aware resource allocation• Ability to query the software stack for network topology

    information:

    • At job allocation time/pre-launch• During runtime

    • Ability to change the mapping• Fixed shape partitions - mapping can be specified apriori (at job submission

    time)

    • At job allocation time - specify new mapping before job is launched

    15

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Topology friendly supercomputers?

    • Topology aware resource allocation• Ability to query the software stack for network topology

    information:

    • At job allocation time/pre-launch• During runtime

    • Ability to change the mapping• Fixed shape partitions - mapping can be specified apriori (at job submission

    time)

    • At job allocation time - specify new mapping before job is launched

    • Topology aware MPI implementations• Optimized collectives

    15

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Survey of machines (top500.org)

    16

    Machine Network / topology Resource ManagerTopology friendly?

    K computer Tofu / 6D torus

    Tianhe-1A Proprietary fat-tree SLURM

    Jaguar (XT5) Seastar2+ / 3D torus Moab/ TORQUE/ ALPS

    Nebulae Infiniband

    Tsubame 5.0 Infiniband N1 Grid Engine/ PBS

    Cielo (XE6) Gemini/ 3D torus Moab/ TORQUE/ ALPS

    Pleiades (SGI Altix) Infiniband/ 11D hypercube PBS

    Tera 100 Infiniband SLURM

    RoadRunner Infiniband

    Jugene (BG/P) 3D Torus LoadLeveler

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Survey of machines (top500.org)

    16

    Machine Network / topology Resource ManagerTopology friendly?

    K computer Tofu / 6D torus

    Tianhe-1A Proprietary fat-tree SLURM

    Jaguar (XT5) Seastar2+ / 3D torus Moab/ TORQUE/ ALPS

    Nebulae Infiniband

    Tsubame 5.0 Infiniband N1 Grid Engine/ PBS

    Cielo (XE6) Gemini/ 3D torus Moab/ TORQUE/ ALPS

    Pleiades (SGI Altix) Infiniband/ 11D hypercube PBS

    Tera 100 Infiniband SLURM

    RoadRunner Infiniband

    Jugene (BG/P) 3D Torus LoadLeveler

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Survey of machines (top500.org)

    16

    Machine Network / topology Resource ManagerTopology friendly?

    K computer Tofu / 6D torus

    Tianhe-1A Proprietary fat-tree SLURM

    Jaguar (XT5) Seastar2+ / 3D torus Moab/ TORQUE/ ALPS

    Nebulae Infiniband

    Tsubame 5.0 Infiniband N1 Grid Engine/ PBS

    Cielo (XE6) Gemini/ 3D torus Moab/ TORQUE/ ALPS

    Pleiades (SGI Altix) Infiniband/ 11D hypercube PBS

    Tera 100 Infiniband SLURM

    RoadRunner Infiniband

    Jugene (BG/P) 3D Torus LoadLeveler

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Overall system utilization?

    17

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Resource Management in HPC

    • Manage incoming jobs in the queue (assign priorities) - job/batch scheduler

    • Launch/monitor jobs on the compute nodes - job launcher

    • Manage and allocate resources - workload/resource manager

    18

    Typically referred to as a resource manager or job scheduler

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Resource manager requirements

    • Resource allocator needs to be topology aware• contiguous partitions on 3d mesh/torus• nodes on the same switch for fat-trees

    • Job launcher should be able to launch MPI processes on specific nodes

    • On BG/P systems, mpirun can take a mapfile as input• On Cray systems, aprun can take a list of node ids

    • Within-node mapping• Most job launchers support this through cpu/task affinity options

    19

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Mapping algorithms

    • Steps involved in mapping:• Collect processor topology information• Collect application topology information• Run the mapping algorithm• Communicate the new mapping to processes

    • Approaches• Centralized• Completely distributed• Distributed with global view

    20

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Mapping Challenges

    • At exascale, we cannot store O(p) information on one processor

    • For asymmetric topologies, how to query/store the topology information in a scalable fashion (MPI topology routines)

    • how to collect/store the communication graph in a scalable fashion (MPI graph routines)

    • Centralized algorithms: O(n log n) or O(n) might be too slow

    • Distributed algorithms: convergence is slow

    21

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Scalable mapping algorithms

    • Scalable algorithms: distributed algorithms with global view

    • as good as centralized algorithms?

    22

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Completely Distributed Mapping

    • Do the mapping in parallel• With some sense of global load distribution• Using parallel prefix (discussed in literature)

    • Map n objects communicating in a 1D ring pattern to a linear array of p processors

    • Map n objects communicating in a 2D stencil pattern to a 2D mesh of p processors

    23

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    • Object i communicates with i-1 and i+1• We want to make cuts in the 1D ring based on the loads

    of the objects

    • Each processor will then have only two external communication arcs

    24

    1D ring to a linear array

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    • Object i communicates with i-1 and i+1• We want to make cuts in the 1D ring based on the loads

    of the objects

    • Each processor will then have only two external communication arcs

    24

    1D ring to a linear array

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    1D ring to a linear array

    • Perform a parallel prefix sum between objects and send total load to all objects

    25

    5 9 1 2 3 1 6 4

    5 14 10 3 5 4 7 10

    5 14 15 17 15 7 12 14

    5 14 15 17 20 21 27 31

    v1 v2 v3 v4 v5 v6 v7 v8

    Figure 12.1: Prefix sum in parallel to obtain partial sums of loads of all objects up

    to a certain object

    prefix operation and then migrate to the respective processors.

    12.1.1 Complexity Analysis

    Let us compare the running time and memory requirements for the centralized versus

    completely distributed load balancing algorithms. Let us assume that there are n

    objects (or VPs) to be placed on p physical processors.

    In the centralized scheme, one processor stores all the information and hence

    the memory requirements are proportional to the number of objects, v. In the

    distributed case, each processor stores information about its objects which is v/p.

    In the centralized case, each processor sends a message to one processor with its

    load information, which leads to p messages of size v/p each. On the other hand, in

    the parallel prefix there are logv phases and v messages of constant size exchanged

    in each phase. These comparisons are summarized in Table 12.1 below.

    In the centralized case, if we assume that the fastest algorithm for load balancing

    will have a linear running time, the time complexity for the decision making can be

    119

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    1D ring to a linear array

    • Perform a parallel prefix sum between objects and send total load to all objects

    25

    5 9 1 2 3 1 6 4

    5 14 10 3 5 4 7 10

    5 14 15 17 15 7 12 14

    5 14 15 17 20 21 27 31

    v1 v2 v3 v4 v5 v6 v7 v8

    Figure 12.1: Prefix sum in parallel to obtain partial sums of loads of all objects up

    to a certain object

    prefix operation and then migrate to the respective processors.

    12.1.1 Complexity Analysis

    Let us compare the running time and memory requirements for the centralized versus

    completely distributed load balancing algorithms. Let us assume that there are n

    objects (or VPs) to be placed on p physical processors.

    In the centralized scheme, one processor stores all the information and hence

    the memory requirements are proportional to the number of objects, v. In the

    distributed case, each processor stores information about its objects which is v/p.

    In the centralized case, each processor sends a message to one processor with its

    load information, which leads to p messages of size v/p each. On the other hand, in

    the parallel prefix there are logv phases and v messages of constant size exchanged

    in each phase. These comparisons are summarized in Table 12.1 below.

    In the centralized case, if we assume that the fastest algorithm for load balancing

    will have a linear running time, the time complexity for the decision making can be

    119

    the basic technique in a simpler context.

    2. The second scenario is where we have a two-dimensional array of objects where

    each object communicates with two immediate neighbors in its row and col-

    umn. We wish to map this group of objects on to a 2D mesh of processors.

    12.1 Mapping of a 1D Ring

    Problem: Load balancing a 1D array of v objects which communicate in a ring

    pattern to a 1D linear array of p processors.

    Solution: We want to map these objects on to processors while considering the

    load of each object and the communication patterns among the objects. In order to

    optimize communication, we want to place objects next to each other on the same

    processor as much as possible and cross processor boundaries only for ensuring load

    balance. We assume that the IDs of objects denote the nearness in terms of who

    communicates with whom. Hence the problem reduces to finding contiguous groups

    of objects in the 1D array such that the load on all processors is nearly the same.

    We arrange the objects virtually by their IDs and perform a prefix sum in parallel

    between them based on the object loads. At the conclusion of a prefix sum, every

    object knows the sum of loads of all objects that appear before it (Figure 12.1).

    Then the last object broadcasts the sum of loads of all objects so that every object

    knows the global load of the system. Each object i, can calculate its destination

    processor (di), based on the total load of all objects (Lv), prefix sum of loads up to

    it (Li), its load (li) and the total number of processors (p), by this equation,

    di = �p ∗Li − li/2

    Lv� (12.1)

    So, every object can decide its destination processor in parallel through a parallel

    118

    Li = Prefix sumLv = Sum of all loadsp = no. of pesli = load of objectdi = destination pe

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    1D ring to a linear array

    • Perform a parallel prefix sum between objects and send total load to all objects

    25

    5 9 1 2 3 1 6 4

    5 14 10 3 5 4 7 10

    5 14 15 17 15 7 12 14

    5 14 15 17 20 21 27 31

    v1 v2 v3 v4 v5 v6 v7 v8

    Figure 12.1: Prefix sum in parallel to obtain partial sums of loads of all objects up

    to a certain object

    prefix operation and then migrate to the respective processors.

    12.1.1 Complexity Analysis

    Let us compare the running time and memory requirements for the centralized versus

    completely distributed load balancing algorithms. Let us assume that there are n

    objects (or VPs) to be placed on p physical processors.

    In the centralized scheme, one processor stores all the information and hence

    the memory requirements are proportional to the number of objects, v. In the

    distributed case, each processor stores information about its objects which is v/p.

    In the centralized case, each processor sends a message to one processor with its

    load information, which leads to p messages of size v/p each. On the other hand, in

    the parallel prefix there are logv phases and v messages of constant size exchanged

    in each phase. These comparisons are summarized in Table 12.1 below.

    In the centralized case, if we assume that the fastest algorithm for load balancing

    will have a linear running time, the time complexity for the decision making can be

    119

    the basic technique in a simpler context.

    2. The second scenario is where we have a two-dimensional array of objects where

    each object communicates with two immediate neighbors in its row and col-

    umn. We wish to map this group of objects on to a 2D mesh of processors.

    12.1 Mapping of a 1D Ring

    Problem: Load balancing a 1D array of v objects which communicate in a ring

    pattern to a 1D linear array of p processors.

    Solution: We want to map these objects on to processors while considering the

    load of each object and the communication patterns among the objects. In order to

    optimize communication, we want to place objects next to each other on the same

    processor as much as possible and cross processor boundaries only for ensuring load

    balance. We assume that the IDs of objects denote the nearness in terms of who

    communicates with whom. Hence the problem reduces to finding contiguous groups

    of objects in the 1D array such that the load on all processors is nearly the same.

    We arrange the objects virtually by their IDs and perform a prefix sum in parallel

    between them based on the object loads. At the conclusion of a prefix sum, every

    object knows the sum of loads of all objects that appear before it (Figure 12.1).

    Then the last object broadcasts the sum of loads of all objects so that every object

    knows the global load of the system. Each object i, can calculate its destination

    processor (di), based on the total load of all objects (Lv), prefix sum of loads up to

    it (Li), its load (li) and the total number of processors (p), by this equation,

    di = �p ∗Li − li/2

    Lv� (12.1)

    So, every object can decide its destination processor in parallel through a parallel

    118

    Li = Prefix sumLv = Sum of all loadsp = no. of pesli = load of objectdi = destination pe

    • Each object now decides which processor it should be on

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    2D stencil to 2D mesh

    26

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    2D stencil to 2D mesh

    26

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    2D stencil to 2D mesh

    • Linearize using a space filling curve• Perform 1D parallel prefix to obtain linear index of processor

    26

    Solution 2: Another solution is to use a parallel prefix in 1D as we did in the

    previous section. To do this, we have to linearize the objects in some fashion. Space

    filling curves can be used to map the 2D object grid to a 1D line [[87, 88]]. Space

    filling curves preserve the neighborhood properties of the objects in 2D. Figure 12.2

    shows the linearization of an object grid of dimensions 32× 32 and a processor grid

    of dimensions 8× 8.

    Figure 12.2: Hilbert order linearization of an object grid of dimensions 32× 32 anda processor mesh of dimensions 8× 8

    Having linearized the object grid, we can perform a parallel prefix on 1D array of

    objects and obtain a destination processor for each object. This processor number

    is the linearized index of each processor if we create a space filling curve for the 2D

    processor mesh. Hence, based on this linearized index, we can obtain the x and y

    coordinates of the processor by decoding using the same logic used for generating

    space filling curves. The hope is that nearby objects in the 2D object graph end up

    close to one another on the 2D processor mesh.

    121

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    2D stencil to 2D mesh

    • Linearize using a space filling curve• Perform 1D parallel prefix to obtain linear index of processor

    26

    Solution 2: Another solution is to use a parallel prefix in 1D as we did in the

    previous section. To do this, we have to linearize the objects in some fashion. Space

    filling curves can be used to map the 2D object grid to a 1D line [[87, 88]]. Space

    filling curves preserve the neighborhood properties of the objects in 2D. Figure 12.2

    shows the linearization of an object grid of dimensions 32× 32 and a processor grid

    of dimensions 8× 8.

    Figure 12.2: Hilbert order linearization of an object grid of dimensions 32× 32 anda processor mesh of dimensions 8× 8

    Having linearized the object grid, we can perform a parallel prefix on 1D array of

    objects and obtain a destination processor for each object. This processor number

    is the linearized index of each processor if we create a space filling curve for the 2D

    processor mesh. Hence, based on this linearized index, we can obtain the x and y

    coordinates of the processor by decoding using the same logic used for generating

    space filling curves. The hope is that nearby objects in the 2D object graph end up

    close to one another on the 2D processor mesh.

    121

    Solution 2: Another solution is to use a parallel prefix in 1D as we did in the

    previous section. To do this, we have to linearize the objects in some fashion. Space

    filling curves can be used to map the 2D object grid to a 1D line [[87, 88]]. Space

    filling curves preserve the neighborhood properties of the objects in 2D. Figure 12.2

    shows the linearization of an object grid of dimensions 32× 32 and a processor grid

    of dimensions 8× 8.

    Figure 12.2: Hilbert order linearization of an object grid of dimensions 32× 32 anda processor mesh of dimensions 8× 8

    Having linearized the object grid, we can perform a parallel prefix on 1D array of

    objects and obtain a destination processor for each object. This processor number

    is the linearized index of each processor if we create a space filling curve for the 2D

    processor mesh. Hence, based on this linearized index, we can obtain the x and y

    coordinates of the processor by decoding using the same logic used for generating

    space filling curves. The hope is that nearby objects in the 2D object graph end up

    close to one another on the 2D processor mesh.

    121

    • Decode (x, y) of processor using a space filling curve

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Running Time

    • Benefits from distributed mapping• Memory usage per processor - O(v/p) compared to O(v)• No communication bottleneck at a single processor• Faster decision time: O(log p)

    • Time (in seconds) for load balancing on Cray XT5

    27

    # cores 4096 16384

    1 million1.81 6.77

    1 million3.29 3.49

    4 million12.96 9.59

    4 million12.65 5.68

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Results: 1 million objects on 4k cores

    28

    0.1

    1

    10

    100

    Trial 1 Trial 2 Trial 3 Trial 4 Trial 5

    Hop

    s pe

    r by

    te

    0

    0.35

    0.7

    1.05

    1.4

    Trial 1 Trial 2 Trial 3 Trial 4 Trial 5

    Max

    to

    Avg

    Rat

    io

    RandomDefaultTopology

    Reduction in Hop-bytes Ratio of maximum to average load

    di = distancebi = bytesn = no. of messages

    5 Hop-bytes as an Evaluation Metric

    The volume of inter-processor communication can be characterized by the hop-bytes

    metric which is the weighted sum of message sizes where the weights are the number

    of hops (links) traveled by the respective messages. Hop-bytes can be calculated by

    the equation,

    HB =

    n�

    i=1

    di × bi (5.1)

    where di is the number of links traversed by message i and bi is the message size in

    bytes for message i and the summation is over all messages sent.

    Hop-bytes is an indication of the average communication load on each link on the

    network. This assumes that the application generates nearly uniform traffic over all

    links in the partition. The metric does not give an indication of hot-spots generated

    on specific links on the network but is an easily derivable metric and correlates well

    with actual application performance.

    In VLSI circuit design and early parallel computing work, emphasis was placed

    on another metric called maximum dilation which is defined as,

    d(e) = max{di|ei ∈ E} (5.2)

    where di is the dilation of the edge ei. This metric aims at minimizing the longest

    length of the wire in a circuit. We claim that reducing the largest number of links

    traveled by any message is not as critical as reducing the average hops across all

    messages.

    32

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Results: 1 million objects on 4k cores

    28

    0.1

    1

    10

    100

    Trial 1 Trial 2 Trial 3 Trial 4 Trial 5

    Hop

    s pe

    r by

    te

    0

    0.35

    0.7

    1.05

    1.4

    Trial 1 Trial 2 Trial 3 Trial 4 Trial 5

    Max

    to

    Avg

    Rat

    io

    RandomDefaultTopology

    Reduction in Hop-bytes Ratio of maximum to average load

  • LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing

    Summary

    • Topology awareness required at various levels:• Resource Managers (allocator, job launcher): topology aware

    scheduling

    • MPI implementations (optimized collectives)• Software support for querying network topology• Distributed graph representations for communication structure

    • Scalable distributed algorithms• With global view of load and partial view of communication

    structure

    29

  • This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551

    LLNL-PRES-529376

    SIAM  Conference  on  Parallel  Processing  ◆  February  15,  2012

    Questions?Abhinav Bhatele, Automating Topology Aware Mapping for Supercomputers, PhD Thesis, Department of Computer Science, University of Illinois.

    http://charm.cs.uiuc.edu/research/topology

    http://charm.cs.uiuc.edu/research/topologyhttp://charm.cs.uiuc.edu/research/topology