how to createapplications with multi-million way parallelism

30
WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC How to CreateApplications With Multi-million Way parallelism Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu

Upload: sierra

Post on 15-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

How to CreateApplications With Multi-million Way parallelism. Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu. Group Mission and Approach. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

How to CreateApplications With Multi-million Way parallelism

Laxmikant (Sanjay) KaleParallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana-Champaign

http://charm.cs.uiuc.edu

Page 2: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Group Mission and Approach

• To enhance Performance and Productivity in programming complex parallel applications– Performance: scalable to very large number of processors

– Productivity: of human programmers

– Complex: irregular structure, dynamic variations

• Approach: Application Oriented yet CS centered research– Develop enabling technology, for a wide collection of apps.

– Develop, use and test it in the context of real applications

– Develop standard library of reusable parallel components

Page 3: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Multi-partition Decomposition

• Idea: divide the computation into a large number of pieces– Independent of number of processors– Typically larger than number of processors– Let the system map entities to processors

• Optimal division of labor between “system” and programmer:

• Decomposition done by programmer,

• Everything else automated

Page 4: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Object-based Parallelization

User View

System implementation

User is only concerned with interaction between objects

Page 5: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Charm++

• Parallel C++ with Data Driven Objects• Object Arrays/ Object Collections• Object Groups:

– Global object with a “representative” on each PE

• Asynchronous method invocation• Prioritized scheduling• Information sharing abstractions: readonly, tables,..• Mature, robust, portable• http://charm.cs.uiuc.edu

Page 6: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Data driven execution

Scheduler Scheduler

Message Q Message Q

Page 7: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Load Balancing Framework

• Based on object migration – Partitions implemented as objects (or threads) are

mapped to available processors by LB framework

• Measurement based load balancers:– Principle of persistence

• Computational loads and communication patterns

– Runtime system measures actual computation times of every partition, as well as communication patterns

• Variety of “plug-in” LB strategies available– Including those for situations when principle of

persistence does not apply

Page 8: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Building on Object-based Parallelism

• Application induced load imbalances• Environment induced performance issues:

– Dealing with extraneous loads on shared m/cs

– Vacating workstations

– Heterogeneous clusters

– Shrinking and Expanding jobs to available Pes

• Object “migration”: novel uses– Automatic checkpointing

– Automatic prefetching for out-of-core execution

• Reuse: object based components

Page 9: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Applications

• Charm++ developed in the context of real applications

• Current applications we are involved with:– Molecular dynamics– Crack propagation– Rocket simulation: fluid dynamics + structures +– QM/MM: Material properties via quantum mech– Cosmology simulations: parallel analysis+viz– Cosmology: gravitational with multiple timestepping

Page 10: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Molecular Dynamics

• Collection of [charged] atoms, with bonds

• Newtonian mechanics

• At each time-step– Calculate forces on each atom

• Bonds:

• Non-bonded: electrostatic and van der Waal’s

– Calculate velocities and advance positions

• 1 femtosecond time-step, millions needed!

• Thousands of atoms (1,000 - 100,000)

Page 11: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Page 12: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Page 13: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

BC1 complex: 200k atoms

Page 14: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Performance Data: SC2000

Speedup on ASCI Red: BC1 (200k atoms)

0

200

400

600

800

1000

1200

1400

0 500 1000 1500 2000 2500

Processors

Spe

edup

Page 15: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Charm++ Is a Good Match for M-PIM

• Encapsulation : objects• Cost model:

– Object data, read-only data, remote data

• Migration and resource management: automatic• One sided communication: since the beginning• Asynchronous global operations (reductions, ..)• Modularity: see 1996 paper• Acceptability:

– C++

– Now also: AMPI on top of charm++

Page 16: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

AMPI: Goals• Runtime adaptivity for MPI programs

– Based on multi-domain decomposition and dynamic load balancing features of Charm++

– Minimal changes to the original MPI code

– Full MPI 1.1 standard compliance

– Additional support for coupled codes

– Automatic conversion of existing MPI programs

Original MPI Code AMPI Code

AMPI Runtime

AMPIzer

Page 17: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

How Good Is the Programmability

• I.E. Do programmers find it easy/good– We think so – Certainly a good intermediate level model

• Higher level abstractions can be built on it

• But what kinds of abstractions?

• We think domain-specific ones

Page 18: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Specialization

MPIexpression

Scheduling

Mapping

Decomposition

HPFCharm++

Domain specific

frameworks

Page 19: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Further Match With MPIM

• Ability to predict:– Which data is going to be needed and– Which code will execute– Based on the ready queue of object method

invocations

Page 20: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Remember data driven execution?

Scheduler Scheduler

Message Q Message Q

Page 21: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Further Match With MPIM

• Ability to predict:– Which data is going to be needed and– Which code will execute– Based on the ready queue of object method

invocations– So, we can:

• Prefetch data accurately• Prefetch code if needed

S SQ Q

Page 22: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

So, What Are We Doing About It?

• How to develop any programming environment for a machine that isn’t built yet

• Blue Gene/C emulator using charm++– Completed last year– Implememnts low level BG/C API

• Packet sends, extract packet from comm buffers

– Emulation runs on machines with hundreds of “normal” processors

• Charm++ on blue Gene /C Emulator

Page 23: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Structure of the Emulators

Blue Gene/CLow-level API

Charm++

Converse

Converse

Charm++

BG/C low level API

Charm++

Page 24: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Emulation on a Parallel Machine

Simulating (Host) Processor

BG/C Nodes

Hardware thread

Page 25: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Extensions to Charm++ for BG/C

• Microtasks:– Objects may fore microtasks that can be

executed by any thread on the same node– Increases parallelism– Overhead: sub-microsecond

• Issue:– Object affinity: map to thread or node?

• Thread, currently.

• Microtasks alleviate load balancing within a node

Page 26: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Emulation efficiency

• How much time does it take to run an emulation?– 8 Million processors being emulated on 100– In addition, lower cache performance– Lots of tiny messages

• On a Linux cluster:– Emulation shows good speedup

Page 27: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Emulator to Simulator

• Step 1: Coarse grained simulation– I.e. performance prediction capability– Models contention for processor/thread– Also models communication delay based on distance– Doesn’t model memory access on chip, or network– How to do this in spite of out-of-order message

delivery?• Rely on determinism of Charm++ programs

• Time stamped messages and threads

• Parallel time-stamp correction algorithm

Page 28: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Applications on the current system

• Using BG Charm++

• LeanMD:– Research quality Molecular Dyanmics– Version 0: only electrostatics + van der Vaal

• Simple AMR kernel– Adaptive tree to generate millions of objects

• Each holding a 3D array

– Communication with “neighbors”• Tree makes it harder to find nbrs, but Charm makes it easy

Page 29: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Emulator to Simulator

• Step 2: Add fine grained simulation– Sarita Adve: RSIM based simulation of a node

• SMP node first

– Millions of thread units/caches to simulate in detail?

• Step 3: Hybrid simulation– Instead: use detailed simulation to build model– Drive coarse simulation using model behavior

Page 30: How to CreateApplications With  Multi-million Way parallelism

WIMPS: Feb 2002 PPL-Dept of Computer Science, UIUC

Summary

• Charm++ (data-driven migratable objects)– is a well-matched candidate programming

model for M-PIMs

• We have developed an Emulator/Simulator – For BG/C– Runs on parallel machines

• We have Implemented multi-million object applications using Charm++– And tested on emulated Blue Gene/C

• More info: http://charm.cs.uiuc.edu