Download - CS 267 Spring 2008 Horst Simon UC Berkeley May 15, 2008 Code Generation Framework for Process Network Models onto Parallel Platforms Man-Kit Leung, Isaac

CS 267 Spring 2008Horst SimonUC BerkeleyMay 15, 2008

Code Generation Framework for Process Network Models onto Parallel Platforms

Man-Kit Leung, Isaac Liu, Jia Zou

Final Project Presentation

Leung, Liu ,Zou 2 / 18 CS267 Sp 08 Final Presentation UC Berkeley

Outline

• Motivation• Demo• Code Generation Framework• Application and Results• Conclusion


Motivation

• Parallel programming is difficult…- Functional correctness

- Performance debugging + tuning (Basically, trial & error)

• Code generation as a tool– Systematically explore implementation space

– Rapid development / prototyping– Optimize performance– Maximize (programming) reusability

– Correct-by-construction [E. Dijkstra ’70]– Minimize human errors (bugs)– Eliminates the need for low-level testing

– Because, otherwise, manual coding is too costly• Especially true for multiprocessors/distributed platforms


Higher-Level Programming Model

Source Actor1

SinkActor

Source Actor2

Implicit Buffers

Implicit Buffers

• Kahn Process Networks (KPNs) is a distributed model of computation (MoC) where a group of processing units are connected by communication channels to form a network of processes.

– The communication channels are FIFO queues.

– “The Semantics of a Simple Language For Parallel Programming” [GK ’74]

• Deterministic

• Inherently parallel

• Expressive


MPI Code Generation Workflow

Analyze & annotate model • Assume weights on edges & nodes• Generate cluster info (buffer & grouping)

Analyze & annotate model • Assume weights on edges & nodes• Generate cluster info (buffer & grouping)

Generate MPI code• SIMD (Single Instruction Multiple Data)

Generate MPI code• SIMD (Single Instruction Multiple Data)

Execute code • Obtain execution statistics for tuning

Execute code • Obtain execution statistics for tuning

Partitioning(Mapping)

Model

Given a (KPN) ModelGiven a (KPN) Model

Executable

Code Generation


Demo

The codegen facility is in the Ptolemy II nightly release

- http://chess.eecs.berkeley.edu/ptexternal/nightly/


Partitioning(Mapping)

Models

Code Generation

Executable

Role of Code Generation

Ptolemy IIPtolemy II

Platform-based Design [AS ‘02]


Implementation Spacefor Distributed Environment

• Mapping• # of logical processing units• # of cores / processors• Network costs

• Latency• Throughput

• Memory Constraint• Communication buffer size

• Minimization metrics• Costs• Power consumption• …


Partition

• Using node and edge weights abstractions• Annotation on the model

• From the model, the input file to Chaco is generated.

• After Chaco produces the output file, the partitions are automatically annotated onto the model.


Multiprocessor Architectures

• Shared Memory vs. Message Passing– We want to generate code that will run on both

kinds of architectures– Message passing:

• Message Passing Interface(MPI) as the implementation

– Shared memory:• Pthread implementation available for comparison• UPC and OpenMP as future work


Pthread Implementation

void Actor1 (void) { ...}void Actor2 (void) { ...}

void Model (void) {

pthread_create(&Actor1…);

pthread_create(&Actor2…);

pthread_join(&Actor1…);

pthread_join(&Actor2…);

}

Model


MPI Code Generation

Local buffers

Local buffersMPI send/recvMPI send/recv

MPI Tag MatchingMPI Tag Matching

KPN Scheduling: • Determine when actors are safe to fire• Actors can’t block other actors on same partition• Termination based on a firing count


Sample MPI Program

main() {

if (rank == 0) {

Actor0();

Actor1();

}

if (rank == 1) {

Actor2();

}

...

}

Actor#() {

[1] MPI_Irecv(input);

[2] if (hasInput && !sendBufferFull){

[3] output = localCalc();

[4] MPI_Isend(1, output);

}

}


Application


Execution Platform

NERSC Jacquard CharacteristicsProcessor type Opteron 2.2 GHz

Processor theoretical peak 4.4 GFlops/sec

Number of application processors 712

3.13 TFlops/sec

356

Processors per node 2

Physical memory per node 6 GBytes

Usable memory per node 3-5 GBytes

8

Number of login nodes 4

Switch Interconnect InfiniBand

4.5 usec

620 MB/s

Global shared disk GPFS

Usable disk space 30 TBytes

Batch system PBS Pro

System theoretical peak

(computational nodes)

Number of shared-memory

application nodes

Number of shared-memory spare

application nodes

(In service when possible.)

Switch MPI Unidirectional

Latency

Switch MPI Unidirectional

Bandwidth (peak)


Preliminary Results

# coresMPI

500 IterMPI

1000 IterMPI

2500 IterMPI

5000 IterPthread 500 Iter

Pthread 1000 Iter

Pthread 2500 Iter

Pthread 5000 Iter

2 23.0 (ms) 49.0 137.6 304.0 17.9 47.1 182.0 406.03 18.8 37.4 95.4 195.04 19.4 38.3 97.5 193.0


Conclusion & Future Work

• Conclusion- Framework for code generation to parallel

platforms

- Generate scalable MPI code from Kahn Process Network models

• Future Work- Target more platforms ( UPC, OpenMP etc)

- Additional profiling techniques

- Support more partitioning tools

- Improve performance on generated code


Acknowledgments

• Edward Lee• Horst Simon • Shoaib Kamil• Ptolemy II developers• NERSC• John Kubiatowicz


Extra slides


Why MPI

• Message passing– Good for distributed (shared-nothing) systems

• Very generic– Easy to set up– Required setup (i.e. mpicc and etc.) for one

“master”– Worker nodes only need to have SSH

• Flexible (explicit)– Nonblocking + blocking send/recv

• Cons: required explicit syntax modification (as opposed to OpenMP, Erlang, and etc.)– Solution: automatic code generation


Actor-oriented design: a formalized model of concurrency

produceractor

consumeractor

IOPort

Receiver

Director object oriented actor oriented

• Actor-oriented design hides the states of each actor and makes them inaccessible from other actor

• The emphasis of data flow over control flow leads to conceptually concurrent execution of actors

• The interaction between actors happens in a highly disciplined way • Threads and mutexes become implementation mechanism instead of

part of programming model


Pthread implementation

• Each actor as a separate thread• Implicit buffers

– Each buffer has a read and write count– Condition variable: sleeps and wakes up

threads– Capacity of the buffer

• A global notion of scheduling exists– OS level– All actors are at blocking-read mode implies

the model should terminate


MPI Implementation

• Mapping of actors to cores is needed.– Classic graph partitioning problem– Nodes: actors– Edges: messages– Node weights: computations on each actor– Edge weights: amount of messages

communicated– Partitions: processors

• Chaco chosen as the graph partitioner.


Partition Profiling

• Challenge: providing the user with enough information so node weights and edge weights can be annotated and modified to achieve load balancing.– Solution 1: Static analysis– Solution 2: Simulation– Solution 3: Dynamic load balancing– Solution 4: Profiling the current run and feed

the information back to the user

Download - CS 267 Spring 2008 Horst Simon UC Berkeley May 15, 2008 Code Generation Framework for Process Network Models onto Parallel Platforms Man-Kit Leung, Isaac

Top Related