taking advantages of collective operation semantics for loosely coupled simulations
DESCRIPTION
Taking Advantages of Collective Operation Semantics for Loosely Coupled Simulations. Shang-Chieh Joe Wu* Alan Sussman Department of Computer Science University of Maryland, USA. *graduating soon. Motivation Approximate Matching [Grid 2004] Collective Semantics - PowerPoint PPT PresentationTRANSCRIPT
Taking Advantages of Collective Operation Semantics for Loosely
Coupled Simulations
Shang-Chieh Joe Wu*Alan Sussman
Department of Computer ScienceUniversity of Maryland, USA
*graduating soon
2
Roadmap
• Motivation• Approximate Matching [Grid 2004]
• Collective Semantics
• Dissection of Execution Time
• Smart Buffering
• Future Work
What is the overall problem?
• Obtain more accurate results by coupling existing (parallel) physical simulation components
• Different time and space scales for data produced in shared or overlapped regions
• Runtime decisions for which time-stamped data objects should be exchanged
• Performance becomes a concern
Coupling, is it important?
• Special issue in May/Jun 2004 of IEEE/AIP Computing in Science & Engineering (CSE)
“It’s then possible to couple several existing calculations together through an interface and obtain accurate answers.”
• Multi-scale multi-resolution simulations and models – multiphysics (May/Jun 2005 CSE)
adaptive small-scale noise capture (hydrodynamics)complex fluid and dense suspension (fluid dynamics)patch dynamics (material science)
• Earth System Modeling Frameworkseveral US federal agencies and universities. (http://www.esmf.ucar.edu)
Matching is OUTSIDE components
• Separate matching (coupling) information from the participating componentsMaintainability – Components can be
developed/upgraded individuallyFlexibility – Change participants/components easilyFunctionality – Support variable-sized time interval
numerical algorithms or visualizations
• Matching information is specified separately by application integrator
• Runtime match via simulation timestamps• POSIX thread-based implementation
Separate codes from matching
define region R1define region R4define region R5...Do t = 1, N, Step0 ... // computation jobs export(R1,t) export(R4,t) export(R5,t)EndDo
define region R2...Do t = 1, M, Step1 import(R2,t) ... // computation jobsEndDo
Importer App1
Exporter App0
App1.R0
App2.R0
App4.R0
App0.R1
App0.R4
App0.R5
Configuration file#App0 cluster0 /bin/App0 2 ...App1 cluster1 /bin/App1 4 ...App2 cluster2 /bin/App2 16 ...App4 cluster4 /bin/App4 4#App0.R1 App1.R0 REGL 0.05App0.R1 App2.R0 REGU 0.1App0.R4 App4.R0 REG 1.0#
SPMD
Distributed Array Transfer Library
Basic Operation
Approximate Match Library
Importer component
Request [email protected]
Matched Array@T3
ApproximateMatch
Exporter component
T4
T3
T2
T1
Exported Distributed
Array
ImportedDistributed
Array
Arrays are distributed among multiple processes
Collective Semantics
• Collective operations– All processes in the same component must perform the same
operation, but not necessarily at the same time
• Approximate match is a collective operation– All processes in the same exporter component asynchronously
generates distributed data with the same timestamps (T1 T2 T3 T4) – All processes in the same importer component asynchronously
makes requests with the same timestamps (T3.1)– All processes in the same exporter component must reply to the
requests with the same timestamps (T3 match to T3.1)– Consistent decisions must be made about which copy of
data (Array@T3) should be transferred for shared or overlapped regions
• Approximate match is runtime-based approach, so source code-based optimizations help little
• Different components execute at different speeds, and export/import data at their own rates
• Not all exported data are required by importer components– Exported data, whose size might be very large, may
be buffered when matching decisions cannot yet be made
• Not all processes in the same component execute at same speed– Some complex components can be very hard to
perfectly load balance across all processes
Performance Concerns
• Execution time is composed of – Computation Time – Local Copy Time (might be unnecessary)– Runtime Match Time + Remote Data Transfer
Time
• Same match decisions, for each request, are made repeatedly by all exporter processes in exporter components
• Smart buffering– Faster processes help slower processes in
the same exporter component
Dissection of Execution TimeSmart
Buffering
Smart Buffering
• Exported data are buffered in framework
• A slow exporting process may be able to avoid memory copies, based on– Its responses for previously received import
requests (self-help)– The responses for previous requests satisfied
by the fastest process in the same component (buddy-help)
Smart Buffering Example
Fastest Process
Req Region
treqTimestamp
Slower Process
The Match
Load Balance
• No assumptions about load balance inside each component
• Smart buffering will help with load imbalance at runtime– Slower processes can avoid some
unnecessary work (memory copies)– Component tunes itself at runtime when some
processes fall behind– Framework-level approach – no restrictions
on algorithms/applications
Micro-Benchmark Experiment• utt = uxx + uyy + f(t,x,y), solve 2-d diffusion equation by the
finite element method• A 1024x1024 distributed array is evenly distributed over
participating processes.• 4/8/16/32 P4 2.8GHz processors, connected by Myrinet,
is the importer component U • 4 PIII-650 processors, connected by channel-bonded
Fast Ethernet, is the exporter component F• Two clusters are connected by Gigabit Ethernet.• 1001 data objects exported, and 50 data objects
transferred (20:1)• One process (fs) in the exporter component F performs
extra computation – measuring its data exporting time• Smart buffering can be observed when fs is falling (far)
behind other processes
Smart Buffering Results8 Importer Processes – Exporter component does NOT run Slower
Data Exporting Time for Slowest Process
Data Exporting Time for Slowest Process
Only Buffer Matched Data (Optimal State)
Smart Buffering Results32 Importer Processes
Exporter component runs more slowly from the beginning
Smart Buffering Results
Data Exporting Time for Slowest Process
Nearly No Skips
Some Skips
EnterOptimal
State
16 Importer Processes
Related Work
• Parallel Data Redistribution– Shared data among coupled
parallel models – InterComm (Meta-Chaos),
PAWS, MCT, CUMULVS, Roccom, etc.
– MxN Working group in Common Component Architecture (CCA) Forum
• Coordination Languages– Creating and coordinating execution threads in distributed
computing environment– Linda (tuple space model + directives). Delirium, Strand (new
languages). C-Linda, Fortran-M (extending old languages), plus many others.
Conclusion
• Described a runtime-based approach to speed up slower processes in the same exporter component in (loosely) coupled simulations
• Try to minimize unnecessary buffering for exported data that ends up not being transferred during component execution. Post-processing in the simulations components, or other tools, is not needed
• Perfect synchronization across participating components is not required – can especially benefit “hard-to-load-balance” components
20
Future Work
• Investigate buffering issues between processes, such as non-blocking transfers or RDMA over InfiniBand
• Performance optimizations for slow importers (pattern-based semantic cache)
• Applying the framework to a set of large-scale coupled scientific applications from the space weather domain in progress
21
The End
Questions ?
23
Supported matching policies
<importer request, exporter matched, desired precision> = <x, f(x), p>
• LUB minimum f(x) with f(x) ≥ x• GLB maximum f(x) with f(x) ≤ x• REG f(x) minimizes |f(x)-x| with |f(x)-x| ≤ p• REGU f(x) minimizes f(x)-x with 0 ≤ f(x)-x ≤ p• REGL f(x) minimizes x-f(x) with 0 ≤ x-f(x) ≤ p• FASTR any f(x) with |f(x)-x| ≤ p• FASTU any f(x) with 0 ≤ f(x)-x ≤ p• FASTL any f(x) with 0 ≤ x-f(x) ≤ p