pads 2010, georgia institute of technology, atlanta, ga, usa exploring multi-grained parallelism in...

PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA

Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations

Qi (Jacky) Liu and Gabriel WainerDepartment of Systems and Computer Engineering

Carleton University

Ottawa, Canada

PADS 2010, Georgia Institute of Technology, Atlanta, GA, USA2/18

Outline

Motivation & Background

Fine-Grained Event Parallelism

Parallel DEVS Simulation on Cell

Experimental Results

Conclusion & Future Work

Event Processing Kernel


Motivation

Accelerate general-purpose DEVS-based simulations on heterogeneous CMP architectures like the Cell processor

Develop new parallelization strategies based on fine-grained event-level parallelism inherent in the simulation process

Exploit multi-grained parallelism simultaneously at different levels of the system

Allow general users to gain performance transparently w/o being distracted by multicore programming details

Provide some generalizable methods & insight for PDES on emerging CMP architectures


Cell Processor Overview

Nine-core heterogeneous CMP with two distinct ISAs Software-managed LS with explicitly-addressed DMA transfer Low-latency EIB channels – 32-bit mailbox & signal messages


Discrete-EVent System Specification (DEVS)

M1 M2

M3 M4

Parallel DEVS (P-DEVS) Formalism

Cell-DEVS Formalism


Layered View of M&S


Parallel Simulation with CD++ Flat LP Structure

Structured Simulation Process

(I) LP and model init. (@) model output (*) model state trans. (D) model sync. (X) model input data (Y) model output data


Fine-Grained Event Parallelism

Event-embarrassing parallelism» Independent events within a step

» Executed in an arbitrary order

Event-streaming parallelism» Causally-related events between

consecutive steps

» Executed in a pipelined fashion

Phase-changing events» Exchanged between NC & FC

» Natural fork & join points

Data-flow oriented parallelization


Event Processing Kernel Hydrological Watershed Simulation

» 320×320×2 with 204,800 Simulators» Compute-intensive state transitions» Over 300 million events across 663 phases» Cell-DEVS model defined in CD++ spec. lang.

Simulation Profile on the PPE

SEKConcurrent exec. across SPEs - 98.02%

(event-embarrassing parallelism)

Pipelined exec. between PPE & SPEs - 1.15%

(event-streaming parallelism)


Parallel DEVS Simulation on Cell - Overview

VVECTORECTOR P PARALLELISMARALLELISM (SPE SIMD)(SPE SIMD)

TTHREADHREAD PPARALLELISMARALLELISM

EEVENT-EMBARRASSINGVENT-EMBARRASSING PPARALLELISMARALLELISM

EEVENT-STREAMINGVENT-STREAMING PPARALLELISMARALLELISM

(TWO-STAGE PIPELINE)(TWO-STAGE PIPELINE)

DDATA-STREAMINGATA-STREAMING PPARALLELISMARALLELISM

(DOUBLED-BUFFERED (DOUBLED-BUFFERED DMA AT THREE LAYERS)DMA AT THREE LAYERS)

CCOMPUTE-OMPUTE-I/O I/O PPARALLELISMARALLELISM


Parallel DEVS Simulation on Cell – LP Virtualization

Purpose » Map active Simulators to a limited group of SPE threads» Fit into the small on-chip LS» Assign each SPE a reusable task operating on a stream of data» Facilitate fine-grained dynamic load-balancing between SPEs

Solution» Turn Simulators (and associated atomic models) into virtual LPs» Separate event-processing logic (wrapped in SPE threads) from

state data (maintained in main memory buffers)» Match the states of active Simulators to available SPE threads

dynamically at each virtual time – SEK job scheduling


Parallel DEVS Simulation on Cell – More Details

Virtual Simulator

State Mgmt.

Decentralized

Event Mgmt.


Parallel DEVS Simulation on Cell – More Details

Rule Evaluation on SPEs

SEK Job Scheduling


Platform and Configuration

IBM BladeCenter QS223.2GHz PowerXCell 8i × 2

32GB RAM

Red Hat Enterprise Linux 5.2

IBM SDK for Multicore Acceleration 3.1

Parallel DEVS simulator on Cell CD++/Cell

SEK job scheduling policy

round-robin or

shortest-queue-first

CD++ event-logging turned off

minimize the impact of file I/O


Total Simulation Time with Watershed Model

Performance gain with just one SPE 5.84×» OO C++ code on PPE vs. SIMD-aware C code on SPEs

» memory latency & cache miss vs. data locality & double-buffered DMA

» Low-level optimizations on SPEs (LS data alignment, call stack usage, branch minimization, loop unrolling, in-line substitution, pipelined event execution)

Overall performance with 8 SPEs 33.06×


Speedups over (PPE with 1 SPE) Version

Speedup grows slower with more and more SPEs» Higher overhead for SEK job scheduling and orchestration

» Increased DMA contention & channel stalls


Conclusion

Formalism-Based Design Methodology» Facilitate model reuse & portability» Reduce validation & verification cost

Performance-Centric Approach» Accelerate event processing for compute-intensive DEVS models» Minimize communication & synchronization overhead» Achieve fine-grained dynamic load balancing

New Parallelization Strategy for PDES» Exploit fine-grained event parallelism from a data-flow perspective» Combine multi-grained parallelism at different system levels» Break LP boundaries with LP virtualization

Insight for PDES on Heterogeneous CMP Architectures» Match workload characteristics to functional specialization of cores» Address data locality, memory latency, & code optimization issues


Future Work

Porting different types of models to Cell performance testing» Transparency» Minimal knowledge (and learning curve) from users

Integrating with existing conservative/optimistic approaches» Combine cluster-level LP-based conservative simulation

Using both synchronous & asynchronous algorithms» Combine cluster-level Time Warp optimistic simulation

Using Lightweight Time Warp (DS-RT 2008, PADS 2009)

Testing on large-scale hybrid supercomputers

Using Cell processor in new ways


This research was supported in part

by the MITACS Accelerate Ontario program, Canada,

and by the IBM T. J. Watson Research Center, NY.

[email protected]

http://www.sce.carleton.ca/~liuqi/

ARS Lab: http://cell-devs.sce.carleton.ca/ars/

Questions?


Some Applications

Battlefield Simulations

Crowd Behavior & Evacuation Analysis

Defense & Emergency Planning


Some Applications

Biomedical & Environmental Analysis

Presynaptic Nerve

Krebs Cycle in living organisms Forest fire propagation Watershed formation

Deformable Membrane

pads 2010, georgia institute of technology, atlanta, ga, usa exploring multi-grained parallelism in...

Documents