pads 2010, georgia institute of technology, atlanta, ga, usa exploring multi-grained parallelism in...
TRANSCRIPT
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA
Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations
Qi (Jacky) Liu and Gabriel WainerDepartment of Systems and Computer Engineering
Carleton University
Ottawa, Canada
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA2/18
Outline
Motivation & Background
Fine-Grained Event Parallelism
Parallel DEVS Simulation on Cell
Experimental Results
Conclusion & Future Work
Event Processing Kernel
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA3/18
Motivation
Accelerate general-purpose DEVS-based simulations on heterogeneous CMP architectures like the Cell processor
Develop new parallelization strategies based on fine-grained event-level parallelism inherent in the simulation process
Exploit multi-grained parallelism simultaneously at different levels of the system
Allow general users to gain performance transparently w/o being distracted by multicore programming details
Provide some generalizable methods & insight for PDES on emerging CMP architectures
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA4/18
Cell Processor Overview
Nine-core heterogeneous CMP with two distinct ISAs Software-managed LS with explicitly-addressed DMA transfer Low-latency EIB channels – 32-bit mailbox & signal messages
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA5/18
Discrete-EVent System Specification (DEVS)
M1 M2
M3 M4
Parallel DEVS (P-DEVS) Formalism
Cell-DEVS Formalism
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA6/18
Layered View of M&S
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA7/18
Parallel Simulation with CD++ Flat LP Structure
Structured Simulation Process
(I) LP and model init. (@) model output (*) model state trans. (D) model sync. (X) model input data (Y) model output data
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA8/18
Fine-Grained Event Parallelism
Event-embarrassing parallelism» Independent events within a step
» Executed in an arbitrary order
Event-streaming parallelism» Causally-related events between
consecutive steps
» Executed in a pipelined fashion
Phase-changing events» Exchanged between NC & FC
» Natural fork & join points
Data-flow oriented parallelization
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA9/18
Event Processing Kernel Hydrological Watershed Simulation
» 320×320×2 with 204,800 Simulators» Compute-intensive state transitions» Over 300 million events across 663 phases» Cell-DEVS model defined in CD++ spec. lang.
Simulation Profile on the PPE
SEKConcurrent exec. across SPEs - 98.02%
(event-embarrassing parallelism)
Pipelined exec. between PPE & SPEs - 1.15%
(event-streaming parallelism)
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA10/18
Parallel DEVS Simulation on Cell - Overview
VVECTORECTOR P PARALLELISMARALLELISM (SPE SIMD)(SPE SIMD)
TTHREADHREAD PPARALLELISMARALLELISM
EEVENT-EMBARRASSINGVENT-EMBARRASSING PPARALLELISMARALLELISM
EEVENT-STREAMINGVENT-STREAMING PPARALLELISMARALLELISM
(TWO-STAGE PIPELINE)(TWO-STAGE PIPELINE)
DDATA-STREAMINGATA-STREAMING PPARALLELISMARALLELISM
(DOUBLED-BUFFERED (DOUBLED-BUFFERED DMA AT THREE LAYERS)DMA AT THREE LAYERS)
CCOMPUTE-OMPUTE-I/O I/O PPARALLELISMARALLELISM
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA11/18
Parallel DEVS Simulation on Cell – LP Virtualization
Purpose » Map active Simulators to a limited group of SPE threads» Fit into the small on-chip LS» Assign each SPE a reusable task operating on a stream of data» Facilitate fine-grained dynamic load-balancing between SPEs
Solution» Turn Simulators (and associated atomic models) into virtual LPs» Separate event-processing logic (wrapped in SPE threads) from
state data (maintained in main memory buffers)» Match the states of active Simulators to available SPE threads
dynamically at each virtual time – SEK job scheduling
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA12/18
Parallel DEVS Simulation on Cell – More Details
Virtual Simulator
State Mgmt.
Decentralized
Event Mgmt.
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA13/18
Parallel DEVS Simulation on Cell – More Details
Rule Evaluation on SPEs
SEK Job Scheduling
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA14/18
Platform and Configuration
IBM BladeCenter QS223.2GHz PowerXCell 8i × 2
32GB RAM
Red Hat Enterprise Linux 5.2
IBM SDK for Multicore Acceleration 3.1
Parallel DEVS simulator on Cell CD++/Cell
SEK job scheduling policy
round-robin or
shortest-queue-first
CD++ event-logging turned off
minimize the impact of file I/O
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA15/18
Total Simulation Time with Watershed Model
Performance gain with just one SPE 5.84×» OO C++ code on PPE vs. SIMD-aware C code on SPEs
» memory latency & cache miss vs. data locality & double-buffered DMA
» Low-level optimizations on SPEs (LS data alignment, call stack usage, branch minimization, loop unrolling, in-line substitution, pipelined event execution)
Overall performance with 8 SPEs 33.06×
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA16/18
Speedups over (PPE with 1 SPE) Version
Speedup grows slower with more and more SPEs» Higher overhead for SEK job scheduling and orchestration
» Increased DMA contention & channel stalls
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA17/18
Conclusion
Formalism-Based Design Methodology» Facilitate model reuse & portability» Reduce validation & verification cost
Performance-Centric Approach» Accelerate event processing for compute-intensive DEVS models» Minimize communication & synchronization overhead» Achieve fine-grained dynamic load balancing
New Parallelization Strategy for PDES» Exploit fine-grained event parallelism from a data-flow perspective» Combine multi-grained parallelism at different system levels» Break LP boundaries with LP virtualization
Insight for PDES on Heterogeneous CMP Architectures» Match workload characteristics to functional specialization of cores» Address data locality, memory latency, & code optimization issues
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA18/18
Future Work
Porting different types of models to Cell performance testing» Transparency» Minimal knowledge (and learning curve) from users
Integrating with existing conservative/optimistic approaches» Combine cluster-level LP-based conservative simulation
Using both synchronous & asynchronous algorithms» Combine cluster-level Time Warp optimistic simulation
Using Lightweight Time Warp (DS-RT 2008, PADS 2009)
Testing on large-scale hybrid supercomputers
Using Cell processor in new ways
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA19/19
This research was supported in part
by the MITACS Accelerate Ontario program, Canada,
and by the IBM T. J. Watson Research Center, NY.
http://www.sce.carleton.ca/~liuqi/
ARS Lab: http://cell-devs.sce.carleton.ca/ars/
Questions?
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA
Some Applications
Battlefield Simulations
Crowd Behavior & Evacuation Analysis
Defense & Emergency Planning
PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA
Some Applications
Biomedical & Environmental Analysis
Presynaptic Nerve
Krebs Cycle in living organisms Forest fire propagation Watershed formation
Deformable Membrane