using charm++ to mask latency in grid computing applications gregory a. koenig...
Post on 19-Jan-2016
Embed Size (px)
Using Charm++ to Mask Latency in Grid Computing ApplicationsGregory A. Koenig (email@example.com)Parallel Programming LaboratoryDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign
2004 Charm++ Workshop
Problem: Latency Tolerance for Multi-Cluster ApplicationsGoal: Good performance for tightly-coupled applications running across multiple clusterssingle campusGrid environment
ScenariosVery large applicationsOn-demand computing
Challenge: Masking the effects of latency on inter-cluster messages
Solution: Processor VirtualizationCharm++ chares and Adaptive MPI threads virtualize the notion of a processor.
A programmer decomposes a program into a large number of virtual processors.
The adaptive runtime system maps virtual processors onto physical processors; the runtime may adjust this mapping as the program executes (load balancing).
If one virtual processor that is mapped to a physical processor cannot make progress, some other virtual processor on the same physical processor may be able to do useful work.
No modification of application software or problem-specific tricks are necessary!
Hypothetical Timeline View of a Multi-Cluster ComputationProcessors A and B are on one cluster, Processor C on a second clusterCommunication between clusters via high-latency WANProcessor Virtualization allows latency to be masked
Charm++ on Virtual Machine Interface (VMI)Message data are passed along VMI send chain and receive chain
Devices on each chain may deliver data directly, manipulate data, and/or pass data to next deviceApplicationCharm++Converse(machine layer)VMIsend chainreceive chainAMPI
Description of ExperimentsExperimental environmentArtificial latency environment: VMI delay device adds a pre-defined latency between arbitrary pairs of nodesTeraGrid environment: Experiments run between NCSA and ANL machines (~1.725 ms one-way latency)
ExperimentsFive-point stencil (2D Jacobi) for matrix sizes 2048x2048 and 8192x8192LeanMD molecular dynamics code running a 30,652 atom system
Five-Point Stencil Results (P=2)
Five-Point Stencil Results (P=16)
Five-Point Stencil Results (P=32)
Five-Point Stencil Results (P=64)
ConclusionProcessor virtualization is a useful technique for masking latency in grid computing environments.Future WorkTesting across NCSA-SDSCLeverage Charm++ prioritized messagesGrid-topology-aware load balancerProcessor speed normalizationLeverage Adaptive MPI