using charm++ to mask latency in grid computing applications gregory a. koenig...

Download Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig ( Parallel Programming Laboratory Department

Post on 19-Jan-2016




0 download

Embed Size (px)


  • Using Charm++ to Mask Latency in Grid Computing ApplicationsGregory A. Koenig ( Programming LaboratoryDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign

    2004 Charm++ Workshop

  • Problem: Latency Tolerance for Multi-Cluster ApplicationsGoal: Good performance for tightly-coupled applications running across multiple clusterssingle campusGrid environment

    ScenariosVery large applicationsOn-demand computing

    Challenge: Masking the effects of latency on inter-cluster messages

  • Solution: Processor VirtualizationCharm++ chares and Adaptive MPI threads virtualize the notion of a processor.

    A programmer decomposes a program into a large number of virtual processors.

    The adaptive runtime system maps virtual processors onto physical processors; the runtime may adjust this mapping as the program executes (load balancing).

    If one virtual processor that is mapped to a physical processor cannot make progress, some other virtual processor on the same physical processor may be able to do useful work.

    No modification of application software or problem-specific tricks are necessary!

  • Hypothetical Timeline View of a Multi-Cluster ComputationProcessors A and B are on one cluster, Processor C on a second clusterCommunication between clusters via high-latency WANProcessor Virtualization allows latency to be masked

  • Charm++ on Virtual Machine Interface (VMI)Message data are passed along VMI send chain and receive chain

    Devices on each chain may deliver data directly, manipulate data, and/or pass data to next deviceApplicationCharm++Converse(machine layer)VMIsend chainreceive chainAMPI

  • Description of ExperimentsExperimental environmentArtificial latency environment: VMI delay device adds a pre-defined latency between arbitrary pairs of nodesTeraGrid environment: Experiments run between NCSA and ANL machines (~1.725 ms one-way latency)

    ExperimentsFive-point stencil (2D Jacobi) for matrix sizes 2048x2048 and 8192x8192LeanMD molecular dynamics code running a 30,652 atom system

  • Five-Point Stencil Results (P=2)

  • Five-Point Stencil Results (P=16)

  • Five-Point Stencil Results (P=32)

  • Five-Point Stencil Results (P=64)

  • LeanMD Results

  • ConclusionProcessor virtualization is a useful technique for masking latency in grid computing environments.Future WorkTesting across NCSA-SDSCLeverage Charm++ prioritized messagesGrid-topology-aware load balancerProcessor speed normalizationLeverage Adaptive MPI