taking advantages of collective operation semantics for loosely coupled simulations

Taking Advantages of Collective Operation Semantics for Loosely

Coupled Simulations

Shang-Chieh Joe Wu*Alan Sussman

Department of Computer ScienceUniversity of Maryland, USA

*graduating soon

2

Roadmap

• Motivation• Approximate Matching [Grid 2004]

• Collective Semantics

• Dissection of Execution Time

• Smart Buffering

• Future Work

What is the overall problem?

• Obtain more accurate results by coupling existing (parallel) physical simulation components

• Different time and space scales for data produced in shared or overlapped regions

• Runtime decisions for which time-stamped data objects should be exchanged

• Performance becomes a concern

Coupling, is it important?

• Special issue in May/Jun 2004 of IEEE/AIP Computing in Science & Engineering (CSE)

“It’s then possible to couple several existing calculations together through an interface and obtain accurate answers.”

• Multi-scale multi-resolution simulations and models – multiphysics (May/Jun 2005 CSE)

adaptive small-scale noise capture (hydrodynamics)complex fluid and dense suspension (fluid dynamics)patch dynamics (material science)

• Earth System Modeling Frameworkseveral US federal agencies and universities. (http://www.esmf.ucar.edu)

Matching is OUTSIDE components

• Separate matching (coupling) information from the participating componentsMaintainability – Components can be

developed/upgraded individuallyFlexibility – Change participants/components easilyFunctionality – Support variable-sized time interval

numerical algorithms or visualizations

• Matching information is specified separately by application integrator

• Runtime match via simulation timestamps• POSIX thread-based implementation

Separate codes from matching

define region R1define region R4define region R5...Do t = 1, N, Step0 ... // computation jobs export(R1,t) export(R4,t) export(R5,t)EndDo

define region R2...Do t = 1, M, Step1 import(R2,t) ... // computation jobsEndDo

Importer App1

Exporter App0

App1.R0

App2.R0

App4.R0

App0.R1

App0.R4

App0.R5

Configuration file#App0 cluster0 /bin/App0 2 ...App1 cluster1 /bin/App1 4 ...App2 cluster2 /bin/App2 16 ...App4 cluster4 /bin/App4 4#App0.R1 App1.R0 REGL 0.05App0.R1 App2.R0 REGU 0.1App0.R4 App4.R0 REG 1.0#

SPMD

Distributed Array Transfer Library

Basic Operation

Approximate Match Library

Importer component

Request [email protected]

Matched Array@T3

ApproximateMatch

Exporter component

T4

T3

T2

T1

Exported Distributed

Array

ImportedDistributed

Array

Arrays are distributed among multiple processes

Collective Semantics

• Collective operations– All processes in the same component must perform the same

operation, but not necessarily at the same time

• Approximate match is a collective operation– All processes in the same exporter component asynchronously

generates distributed data with the same timestamps (T1 T2 T3 T4) – All processes in the same importer component asynchronously

makes requests with the same timestamps (T3.1)– All processes in the same exporter component must reply to the

requests with the same timestamps (T3 match to T3.1)– Consistent decisions must be made about which copy of

data (Array@T3) should be transferred for shared or overlapped regions

• Approximate match is runtime-based approach, so source code-based optimizations help little

• Different components execute at different speeds, and export/import data at their own rates

• Not all exported data are required by importer components– Exported data, whose size might be very large, may

be buffered when matching decisions cannot yet be made

• Not all processes in the same component execute at same speed– Some complex components can be very hard to

perfectly load balance across all processes

Performance Concerns

• Execution time is composed of – Computation Time – Local Copy Time (might be unnecessary)– Runtime Match Time + Remote Data Transfer

Time

• Same match decisions, for each request, are made repeatedly by all exporter processes in exporter components

• Smart buffering– Faster processes help slower processes in

the same exporter component

Dissection of Execution TimeSmart

Buffering

Smart Buffering

• Exported data are buffered in framework

• A slow exporting process may be able to avoid memory copies, based on– Its responses for previously received import

requests (self-help)– The responses for previous requests satisfied

by the fastest process in the same component (buddy-help)

Smart Buffering Example

Fastest Process

Req Region

treqTimestamp

Slower Process

The Match

Load Balance

• No assumptions about load balance inside each component

• Smart buffering will help with load imbalance at runtime– Slower processes can avoid some

unnecessary work (memory copies)– Component tunes itself at runtime when some

processes fall behind– Framework-level approach – no restrictions

on algorithms/applications

Micro-Benchmark Experiment• utt = uxx + uyy + f(t,x,y), solve 2-d diffusion equation by the

finite element method• A 1024x1024 distributed array is evenly distributed over

participating processes.• 4/8/16/32 P4 2.8GHz processors, connected by Myrinet,

is the importer component U • 4 PIII-650 processors, connected by channel-bonded

Fast Ethernet, is the exporter component F• Two clusters are connected by Gigabit Ethernet.• 1001 data objects exported, and 50 data objects

transferred (20:1)• One process (fs) in the exporter component F performs

extra computation – measuring its data exporting time• Smart buffering can be observed when fs is falling (far)

behind other processes

Smart Buffering Results8 Importer Processes – Exporter component does NOT run Slower

Data Exporting Time for Slowest Process


Only Buffer Matched Data (Optimal State)

Smart Buffering Results32 Importer Processes

Exporter component runs more slowly from the beginning

Smart Buffering Results


Nearly No Skips

Some Skips

EnterOptimal

State

16 Importer Processes

Related Work

• Parallel Data Redistribution– Shared data among coupled

parallel models – InterComm (Meta-Chaos),

PAWS, MCT, CUMULVS, Roccom, etc.

– MxN Working group in Common Component Architecture (CCA) Forum

• Coordination Languages– Creating and coordinating execution threads in distributed

computing environment– Linda (tuple space model + directives). Delirium, Strand (new

languages). C-Linda, Fortran-M (extending old languages), plus many others.

Conclusion

• Described a runtime-based approach to speed up slower processes in the same exporter component in (loosely) coupled simulations

• Try to minimize unnecessary buffering for exported data that ends up not being transferred during component execution. Post-processing in the simulations components, or other tools, is not needed

• Perfect synchronization across participating components is not required – can especially benefit “hard-to-load-balance” components

20

Future Work

• Investigate buffering issues between processes, such as non-blocking transfers or RDMA over InfiniBand

• Performance optimizations for slow importers (pattern-based semantic cache)

• Applying the framework to a set of large-scale coupled scientific applications from the space weather domain in progress

21

The End

Questions ?

23

Supported matching policies

<importer request, exporter matched, desired precision> = <x, f(x), p>

• LUB minimum f(x) with f(x) ≥ x• GLB maximum f(x) with f(x) ≤ x• REG f(x) minimizes |f(x)-x| with |f(x)-x| ≤ p• REGU f(x) minimizes f(x)-x with 0 ≤ f(x)-x ≤ p• REGL f(x) minimizes x-f(x) with 0 ≤ x-f(x) ≤ p• FASTR any f(x) with |f(x)-x| ≤ p• FASTU any f(x) with 0 ≤ f(x)-x ≤ p• FASTL any f(x) with 0 ≤ x-f(x) ≤ p

taking advantages of collective operation semantics for loosely coupled simulations

Documents

data array

timestamps t3 match

distributed data

exportimport data

timestamped data objects

timeapproximate match

timestamps t1 t2 t3

littledifferent components