openhpi - parallel programming concepts - week 6

88
Parallel Programming Concepts OpenHPI Course Week 6 : Patterns and Best Practices Unit 6.1: Parallel Programming Patterns Dr. Peter Tröger + Teaching Team

Upload: peter-troeger

Post on 15-Jan-2015

189 views

Category:

Education


1 download

DESCRIPTION

Week 6 in the OpenHPI course on parallel programming concepts is about patterns and best practices. Find the whole course at http://bit.ly/1l3uD4h.

TRANSCRIPT

Page 1: OpenHPI - Parallel Programming Concepts - Week 6

Parallel Programming Concepts OpenHPI Course Week 6 : Patterns and Best Practices Unit 6.1: Parallel Programming Patterns

Dr. Peter Tröger + Teaching Team

Page 2: OpenHPI - Parallel Programming Concepts - Week 6

Summary: Week 5

■  “Shared nothing” systems provide very good scalability □  Adding new processing elements not limited by “walls”

□  Different options for interconnect technology ■  Task granularity is essential

□  Surface-to-volume effect □  Task mapping problem

■  De-facto standard is MPI programming ■  High level abstractions with

□  Channels □  Actors

□  MapReduce

2

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

„What steps / strategy would you apply to parallelize a given compute-intense program? “

Page 3: OpenHPI - Parallel Programming Concepts - Week 6

The Parallel Programming Problem

3

Execution Environment Parallel Application Match ?

Configuration

Flexible

Type

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 4: OpenHPI - Parallel Programming Concepts - Week 6

Parallelization and Design Patterns

■  Parallel programming relies on experience □  Identification of concurrency

□  Identification of feasible algorithmic structures □  If done wrong, performance / correctness may suffer

■  Rule of thumb: Somebody else is smarter than you !! ■  Design Pattern

□  Best practices, formulated as a template □  Focus on general applicability to common problems □  Well-known in object-oriented programming (“gang of four”)

■  Parallel design patterns in literature

□  Structured parallelization methodologies (== pattern) □  Algorithmic building blocks commonly found (== pattern)

4

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 5: OpenHPI - Parallel Programming Concepts - Week 6

Patterns for Parallel Programming [Mattson et al.]

■  Phases in creating a parallel program □  Finding Concurrency: Identify and

analyze exploitable concurrency □  Algorithm Structure: Structure

the algorithm to take advantage of potential concurrency

□  Supporting Structures: Define program structures and data structures needed for the code

□  Implementation Mechanisms: Threads, processes, messages, …

■  Each phase is a design space

5

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 6: OpenHPI - Parallel Programming Concepts - Week 6

Finding Concurrency Design Space

■  Identify and analyze exploitable concurrency ■  Example: Data Decomposition Pattern

□  Context: Computation is organized around large data manipulation, similar operations on different data parts

□  Solution: Array-based data access (row, block), recursive data structure traversal

■  Example: Group Tasks Pattern □  Context: Tasks shared temporal constraints (e.g. intermediate

data), work on shared data structure □  Solution: Apply ordering constraints to groups of tasks, put

truly independent tasks in one group for better scheduling

6

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 7: OpenHPI - Parallel Programming Concepts - Week 6

Algorithm Structure Design Space

■  Structure the algorithm ■  Consider how the identified concurrency is organized

□  Organize algorithm by tasks ◊  Tasks are embarrassingly parallel, or organized linearly

-> Task Parallelism ◊  Tasks organized by recursive procedure

-> Divide and Conquer □  Organize algorithm by by data dependencies ◊  Linear data dependencies -> Geometric Decomposition ◊  Recursive data dependencies -> Recursive Data

□  Organize algorithm by application data flow ◊  Regular data flow for computation -> Pipeline ◊  Irregular data flow -> Event-Based Coordination

7

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 8: OpenHPI - Parallel Programming Concepts - Week 6

Example: Parallelize Bubble Sort

8

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

■  Bubble sort □  Compare pair-wise and swap,

if in wrong order ■  Finding concurrency demands data

dependency consideration □  Compare-exchange approach

needs some operation order □  Algorithm idea implies hidden

data dependency □  Idea: Parallelize serial rounds

■  Odd-even sort – Compare [odd|even]-indexed pairs and swap, in case □  Apply task parallelism pattern

1 24 18 12 77 <->

1 24 18 12 77 <->

1 18 24 12 77 <->

1 18 12 24 77 <->

1 18 12 24 77 <->

1 18 12 24 77 <->

1 12 18 24 77 <-> ...

Page 9: OpenHPI - Parallel Programming Concepts - Week 6

Example: Parallelize Bubble Sort

9

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

1 24 18 12 77 <->

1 24 18 12 77 <->

1 18 24 12 77 <->

1 18 12 24 77 <->

1 18 12 24 77 <->

1 18 12 24 77 <->

1 12 18 24 77 <-> ...

1 24 18 12 77 <-> <->

1 24 12 18 77 <-> <->

1 12 24 18 77 <-> <->

1 12 18 24 77 <-> <->

1 12 18 24 77 <-> <->

Page 10: OpenHPI - Parallel Programming Concepts - Week 6

Supporting Structures Design Space

■  Software structures that support the expression of parallelism ■  Program structuring patterns - Single Program Multiple Data

(SPMD), master / worker, loop parallelism, fork / join ■  Data structuring patterns - Shared data, shared queue,

distributed array □  Example: Shared data pattern ◊ Define shared abstract data type with concurrency control

(read only, read / write, independent sub sets, …) ◊ Choose appropriate synchronization construct

■  Supporting structures map to algorithm structure □  Example: SPMD works well with geometric decomposition

10

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 11: OpenHPI - Parallel Programming Concepts - Week 6

Patterns for Parallel Programming

11

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Design Space Parallelization Pattern

1. Finding Concurrency Task Decomposition, Data Decomposition, Group Tasks, Order Tasks, Data Sharing, Design Evaluation

2. Algorithm Structure Task Parallelism, Divide and Conquer, Geometric Decomposition, Recursive Data, Pipeline, Event-Based Coordination

3. Supporting Structures SPMD, Master/Worker, Loop Parallelism, Fork/Join, Shared Data, Shared Queue, Distributed Array

4. Implementation Mechanisms

Thread & Process Creation and Destruction, Memory Synchronization, Fences, Barriers, Mutual Exclusion, Message Passing, Collective Communication

Page 12: OpenHPI - Parallel Programming Concepts - Week 6

Our Pattern Language (OPL)

■  Extended version of the Mattson el al. proposals ■  http://parlab.eecs.berkeley.edu/wiki/patterns/patterns

12

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Structural Patterns (map/reduce, ...)

Computational Patterns (Monte Carlo, ...)

Algorithm Strategy Patterns (Task / data parallelism, pipelining, decomposition, ...)

Implementation Strategy Patterns (SPMD, fork/join, Actors, shared queue, BSP, ...)

Concurrent Execution Patterns (SIMD, MIMD, task graph, message passing, mutex, ...)

Page 13: OpenHPI - Parallel Programming Concepts - Week 6

Our Pattern Language (OPL)

■  Structural patterns □  Describe overall computational goal of the application □  “Boxes and arrows”

■  Computational patterns □  Classes of computations (Berkeley dwarves) □  “Computations occurring in the boxes”

■  Algorithm strategy patterns □  High-level strategies to exploit concurrency and parallelism

■  Implementation strategy patterns □  Structures realized in source code □  Program organization and data structures

■  Concurrent execution patterns □  Approaches to support the execution of parallel algorithms □  Strategies that advance a program □  Basic building blocks for coordination of concurrent tasks

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

13

Page 14: OpenHPI - Parallel Programming Concepts - Week 6

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

14

Page 15: OpenHPI - Parallel Programming Concepts - Week 6

Example: Discrete Event Pattern

■  Name: Discrete Event Pattern ■  Problem: Suppose a computational pattern can be decomposed

into groups of semi-independent tasks interacting in an irregular fashion. The interaction is determined by the flow of data between them which implies ordering constraints between the tasks. How can these tasks and their interaction be implemented so they can execute concurrently?

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

15

Page 16: OpenHPI - Parallel Programming Concepts - Week 6

Example: Discrete Event Pattern

■  Solution: A good solution is based on expressing the data flow using abstractions called events, with each event having a task that generates it and a task that processes it. Because an event must be generated before it can be processed, events also define ordering constraints between the tasks. Computation within each task consists of processing events.

Initialize!while(not done)!{! receive event! process event! send events!}!finalize!

1 2

3

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

16

Page 17: OpenHPI - Parallel Programming Concepts - Week 6

Patterns for Efficient Computation [Cool et al.]

■  Nesting Patterns ■  Structured Serial Control Flow Patterns

(Selection, Iteration, Recursion, …) ■  Parallel Control Patterns

(Fork-Join, Stencil, Reduction, Scan, …) ■  Serial Data Management Patterns

(Closures, Objects, …) ■  Parallel Data Management Patterns

(Pack, Pipeline, Decomposition, Gather, Scatter, …)

■  Other Parallel Patterns (Futures, Speculative Selection, Workpile, Search, Segmentation, Category Reduction, …)

■  Non-Deterministic Patterns (Branch and Bound, Transactions, …) ■  Programming Model Support

17

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 18: OpenHPI - Parallel Programming Concepts - Week 6

Parallel Programming Concepts OpenHPI Course Week 6 : Patterns and Best Practices Unit 6.2: Foster’s Methodology

Dr. Peter Tröger + Teaching Team

Page 19: OpenHPI - Parallel Programming Concepts - Week 6

Designing Parallel Algorithms [Foster]

■  Map workload problem on an execution environment □  Concurrency & locality for speedup, scalability

■  Four distinct stages of a methodological approach ■  A) Search for concurrency and scalability

□  Partitioning: Decompose computation and data into small tasks

□  Communication: Define necessary coordination of task execution

■  B) Search for locality and performance □  Agglomeration:

Consider performance and implementation costs □  Mapping:

Maximize processor utilization, minimize communication

19

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 20: OpenHPI - Parallel Programming Concepts - Week 6

Partitioning

■  Expose opportunities for parallel execution through fine-grained decomposition

■  Good partition keeps computation and data together □  Data partitioning leads to data parallelism

□  Computation partitioning leads to task parallelism □  Complementary approaches, can lead to different algorithms □  Reveal hidden structures of the algorithm that have potential □  Investigate complementary views on the problem

■  Avoid replication of either computation or data, can be revised later to reduce communication overhead

■  Activity results in multiple candidate solutions

20

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 21: OpenHPI - Parallel Programming Concepts - Week 6

Partitioning - Decomposition Types

■  Domain Decomposition □  Define small data fragments

□  Specify computation for them □  Different phases of computation

on the same data are handled separately □  Rule of thumb:

First focus on large, or frequently used, data structures ■  Functional Decomposition

□  Split up computation into disjoint tasks, ignore the data accessed for the moment

□  With significant data overlap, domain decomposition is more appropriate

21

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

[Foster] [Foster]

Page 22: OpenHPI - Parallel Programming Concepts - Week 6

Partitioning - Checklist

■  Checklist for resulting partitioning scheme □  Order of magnitude more tasks than processors ?

◊  Keeps flexibility for next steps □  Avoidance of redundant computation and storage needs ? ◊  Scalability for large problem sizes

□  Tasks of comparable size ?

◊ Goal to allocate equal work to processors □  Does number of tasks scale with the problem size ? ◊  Algorithm should be able to solve larger tasks with more

given resources ■  Identify bad partitioning by estimating performance behavior ■  In case, re-formulate the partitioning (backtracking)

□  May even happen in later steps

22

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 23: OpenHPI - Parallel Programming Concepts - Week 6

Communication

■  Specify links between data consumers and data producers ■  Specify kind and number of messages on these links

■  Domain decomposition problems might have tricky communication infrastructures, due to data dependencies

■  Communication in functional decomposition problems can easily be modeled from the data flow between the tasks

■  Categorization of communication patterns □  Local communication (few neighbors) vs.

global communication □  Structured communication (e.g. tree) vs.

unstructured communication □  Static vs. dynamic communication structure □  Synchronous vs. asynchronous communication

23

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 24: OpenHPI - Parallel Programming Concepts - Week 6

Communication - Hints

■  Distribute computation and communication, don‘t centralize algorithm □  Bad example: Central manager for parallel summation

■  Unstructured communication is hard to agglomerate, better avoid it

■  Checklist for communication design □  Do all tasks perform the same amount of communication ? □  Does each task performs only local communication ? □  Can communication happen concurrently ? □  Can computation happen concurrently ?

■  Solve issues by distributing or replicating communication hot spots

24

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 25: OpenHPI - Parallel Programming Concepts - Week 6

Communication - Ghost Cells

■  Domain decomposition might lead to chunks that demand data from each other

■  Solution 1: Copy necessary portion of data (,ghost cells‘) □  If no synchronization is needed after update

□  Data amount and frequency of update influences resulting overhead and efficiency

□  Additional memory consumption ■  Solution 2: Access relevant data ,remotely‘

□  Delays thread coordination until the data is really needed

□  Correctness („old“ data vs. „new“ data) must be considered on parallel progress

25

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 26: OpenHPI - Parallel Programming Concepts - Week 6

Agglomeration

■  Algorithm so far is correct, but not specialized for a particular execution environment

■  Check partitioning and communication decisions again □  Agglomerate tasks for efficient execution on target hardware

□  Replicate data and / or computation for efficiency reasons ■  Resulting number of tasks can still be greater than the number of

processors ■  Three conflicting guiding decisions

□  Reduce communication costs by coarser granularity of computation and communication

□  Preserve flexibility for later mapping by finer granularity

□  Reduce engineering costs for creating a parallel version

26

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 27: OpenHPI - Parallel Programming Concepts - Week 6

Agglomeration - Granularity

■  Since execution environment is now considered, the surface-to-volume effect becomes relevant

■  Late consideration keeps core algorithm flexibility

27

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

[Fos

ter]

Surface-to-volume effect

Page 28: OpenHPI - Parallel Programming Concepts - Week 6

Agglomeration - Checklist

■  Communication costs reduced by increasing locality ? ■  Does replicated computation outweighs its costs in all cases ?

■  Does data replication restrict the problem size ? ■  Do the larger tasks still have similar

computation / communication costs ? ■  Do the larger tasks still act with sufficient concurrency ? ■  Does the number of tasks still scale with the problem size ? ■  How much can the task count decrease, without disturbing load

balancing, scalability, or engineering costs ? ■  Is the transition to parallel code worth the engineering costs ?

28

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 29: OpenHPI - Parallel Programming Concepts - Week 6

Mapping

■  Historically only relevant for shared-nothing systems □  Shared memory systems have the operating system scheduler

□  With NUMA, this may become also relevant in shared memory systems of the future (e.g. PGAS task placement)

■  Minimize execution time by … □  … placing concurrent tasks on different nodes □  … placing tasks with heavy communication on the same node

■  Conflicting strategies, additionally restricted by resource limits

□  Task mapping problem □  Known to be compute-intense (bin packing)

■  Set of sophisticated (dynamic) heuristics for load balancing □  Preference for local algorithms that do not need global

scheduling state

29

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 30: OpenHPI - Parallel Programming Concepts - Week 6

Parallel Programming Concepts OpenHPI Course Week 6 : Patterns and Best Practices Unit 6.3: Berkeley Dwarfs

Dr. Peter Tröger + Teaching Team

Page 31: OpenHPI - Parallel Programming Concepts - Week 6

Common Algorithmic Problems

■  Sources □  Parallel

programming courses

□  Parallel Benchmarks

□  Development guides

□  Parallel Programming books

□  User stories

31

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 32: OpenHPI - Parallel Programming Concepts - Week 6

A View From Berkeley

■  Technical report from Berkeley (2006), defining parallel computing research questions and recommendations

■  Definition of „13 dwarfs“ □  Common designs of

parallel computation and communication

□  Allow better evaluation of programming models and architectures

32

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

The Landscape of Parallel Computing Research: AView from Berkeley

Krste AsanovicRas BodikBryan Christopher CatanzaroJoseph James GebisParry HusbandsKurt KeutzerDavid A. PattersonWilliam Lester PlishkerJohn ShalfSamuel Webb WilliamsKatherine A. Yelick

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2006-183http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

December 18, 2006

Page 33: OpenHPI - Parallel Programming Concepts - Week 6

A View From Berkeley

■  Sources □  EEMBC benchmarks (embedded systems), SPEC benchmarks

□  Database and text mining technology □  Algorithms in computer game design and graphics □  Machine learning algorithms □  Original „7 Dwarfs“ for supercomputing [Colella]

■  „Anti-benchmark“ □  Dwarfs are not tied to code or language artifacts □  Can serve as understandable vocabulary across disciplines □  Allow feasability study of hardware and software design

◊ No need to wait for applications being developed

33

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 34: OpenHPI - Parallel Programming Concepts - Week 6

13 Dwarfs

■  Dwarfs currently defined □  Dense Linear Algebra

□  Sparse Linear Algebra □  Spectral Methods □  N-Body Methods □  Structured Grids

□  Unstructured Grids □  MapReduce

□  Combinational Logic

□  Graph Traversal □  Dynamic Programming □  Backtrack and Branch-and-

Bound □  Graphical Models □  Finite State Machines

34

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

■  One dwarf may be implemented based on another one ■  Increasing uptake in scientific publications

■  Several reference implementations for CPU / GPU

Page 35: OpenHPI - Parallel Programming Concepts - Week 6

Dwarfs in Popular Applications

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

35

Hot →  Cold

[Pat

ters

on]

Page 36: OpenHPI - Parallel Programming Concepts - Week 6

■  Classic vector and matrix operations on non-sparse data (vector op vector, matrix op vector, matrix op matrix)

■  Data layout as continues array(s) ■  High degree of data dependencies

■  Computation on elements, rows, columns or matrix blocks ■  Issues with memory hierarchy, data distribution is critical ■  Demands overlapping of computation and communication

Dense Linear Algebra

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

36

Page 37: OpenHPI - Parallel Programming Concepts - Week 6

Sparse Linear Algebra

■  Operations on a sparse matrix (with lots of zeros) ■  Typically compressed data structures, integer operations,

only non-zero entries + indices □  Dense blocks to exploit caches

■  Complex dependency structure ■  Scatter-gather vector operations

are often helpful

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

37

Page 38: OpenHPI - Parallel Programming Concepts - Week 6

N-Body Methods

■  Physics: Predicting individual motions of an object group interacting gravitationally □  Calculations on interactions between many discrete points

■  Hierarchical tree-based and mesh-based methods, avoid computing all pair-wise interactions

■  Variations with particle-particle methods (one point to all others)

■  Large number of independent calculations in a time step, followed by all-to-all communication

■  Issues with load balancing and missing fixed hierarchy

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

38

Page 39: OpenHPI - Parallel Programming Concepts - Week 6

Structured Grid

■  Data as a regular multidimensional grid □  Access is regular and statically determinable

■  Computation as sequence of grid updates □  Points are updated concurrently using values

from a small neighborhood ■  Spatial locality to use long cache lines ■  Temporal locality to allow cache reuse ■  Parallel mapping with sub-grid per processor

□  Ghost cells, surface to volume ratio ■  Latency hiding

□  Increased number of ghost cells □  Coarse-grained data exchange

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

39

Page 40: OpenHPI - Parallel Programming Concepts - Week 6

Unstructured Grid

■  Elements update neighbors in irregular mesh/grid - static or dynamic structure

■  Problematic data distribution and access requirements, indirection through tables

■  Modeling domain (e.g. physics)

□  Mesh represents surface or volume □  Entities are points, edges, faces, ... □  Applying pressure, temperature, … □  Computations involve numerical

solutions or differential equations □  Sequence of mesh updates

■  Massively data parallel, but irregularly distributed data and communication

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

40

Page 41: OpenHPI - Parallel Programming Concepts - Week 6

MapReduce

■  Originally called “Monte Carlo” in dwarf concept □  Repeated independent execution of a function

(e.g. random number generation, map function) □  Results aggregated at the end

□  Nearly no communication between tasks, embarrassingly parallel

■  Examples: Monte Carlo, BOINC project, protein structures

[htt

p://

clim

ates

anity

.wor

dpre

ss.c

om]

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

41

Page 42: OpenHPI - Parallel Programming Concepts - Week 6

■  Global optimization problem in large search space ■  Divide and Conquer principle

□  Branching into subdivisions □  Optimize execution by ruling out regions

■  Examples: Integer linear programming, boolean satisfiability, combinatorial optimization, traveling salesman, constraint programming, …

■  Heuristics to guide search to productive regions ■  Parallel checking of sub-regions

□  Demands invariants about the search space

□  Demands dynamic load balancing, load prediction is hard ■  Example:

Place N queens on a chessboard so that no two attack each other

Backtrack / Branch-and-Bound

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

42

Page 43: OpenHPI - Parallel Programming Concepts - Week 6

Branch-and-Bound

43

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger [htt

p://

docs

.jbo

ss.o

rg/d

rool

s]

Page 44: OpenHPI - Parallel Programming Concepts - Week 6

44

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger [htt

p://

docs

.jbo

ss.o

rg/d

rool

s]

Page 45: OpenHPI - Parallel Programming Concepts - Week 6

Berkeley Dwarfs

■  Relevance of single dwarfs widely differs

■  No widely accepted single benchmark implementation

■  Computational dwarfs on different layers, implementations may be based on each other

■  OpenDwarfs project □  Optimized code for

different platforms ■  Parallel Dwarfs project

□  In C++, C#, F# for Visual Studio

45

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

The Landscape of Parallel Computing Research: AView from Berkeley

Krste AsanovicRas BodikBryan Christopher CatanzaroJoseph James GebisParry HusbandsKurt KeutzerDavid A. PattersonWilliam Lester PlishkerJohn ShalfSamuel Webb WilliamsKatherine A. Yelick

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2006-183http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

December 18, 2006

Page 46: OpenHPI - Parallel Programming Concepts - Week 6

Parallel Programming Concepts OpenHPI Course Week 6 : Patterns and Best Practices Unit 6.4: Some Future Trends

Dr. Peter Tröger + Teaching Team

Page 47: OpenHPI - Parallel Programming Concepts - Week 6

NUMA Impact Increases

47

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Core Core

Core Core

Q P I

Core Core

Core Core

Q P I

Core Core

Core Core

Q P I

Core Core

Core Core

Q P I L3

Cac

he

L3 C

ache

L3 C

ache

Mem

ory

Con

trolle

r

Mem

ory

Con

trolle

r M

emor

y C

ontro

ller

L3 C

ache

Mem

ory

Con

trolle

r I/O I/O

I/O I/O

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Page 48: OpenHPI - Parallel Programming Concepts - Week 6

Innovation in Memory Technology

■  3D NAND ■  Hybrid Memory Cube

□  Intel, Micron, … □  3D array of DDR-alike

memory cells □  Early samples

available, 160GB/s □  Through-silicon via

(TSV) approach with embedded controllers, attached to CPU

■  RRAM / ReRAM □  Non-volatile memory

48

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

[com

pute

rwor

ld.c

om]

[ext

rem

etec

h.co

m]

Page 49: OpenHPI - Parallel Programming Concepts - Week 6

Power Wall 2.0 = Dark Silicon

“Dark Silicon and the End of Multicore Scaling” by Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, Doug Burger

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

49

Page 50: OpenHPI - Parallel Programming Concepts - Week 6

Hardware / Software Co-Design

■  Increasing number of cores by Moore‘s law ■  Power wall / dark silicon problem will become worse

□  In addition, battery-powered devices become more relevant ■  Idea: Use additional transistors for specialization

□  Design hardware for a software problem □  Make it part of the processor („compile into hardware“)

□  More efficiency, less flexibility □  Partially known from ILP SIMD support □  Examples: Cryptography, regular expressions

■  Example: Cell processor (Playstation 3)

□  64-bit Power core □  8 specialized co-processors

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

50

Page 51: OpenHPI - Parallel Programming Concepts - Week 6

Software at Scale [Dongarra]

■  Effective utilization of many-core and hybrid hardware □  Break fork-join parallelism

□  Dynamic data driven execution, consider block layout □  Exploiting mixed precision (GPU vs. CPU, power consumption)

■  Aim for self-adapting software and auto-tuning support □  Manual optimization is too hard

□  Let software optimize the software ■  Consider fault-tolerant software

□  With 1.000.000's of cores, things break all the time ■  Focus on algorithm classes that reduce communication

□  Special problem in dense computation □  Aim for asynchronous iterations

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

51

Page 52: OpenHPI - Parallel Programming Concepts - Week 6

OpenMP 4.0

■  SIMD extensions □  Portable primitives to describe SIMD parallelization

□  Loop vectorization with simd construct □  Several arguments for guiding the compiler (e.g. alignment)

■  Targeting extensions □  Thread with the OpenMP program executes on the host device

□  Implementation may support multiple target devices □  Control off-loading of loops and code regions on such devices

■  New API for device data environment □  OpenMP - managed data items can be moved to the device

□  New primitives for better cancellation support □  User-defined reduction operations

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

52

Page 53: OpenHPI - Parallel Programming Concepts - Week 6

OpenACC

■  „OpenMP for accelerators“ (GPU, FPGAs, ...) □  Partners: Cray, supercomputing centers, NVIDIA, PGI

□  Annotation in C, C++, and Fortran source code □  OpenACC code can also be started on the accelerator

■  Features □  Specification of data locality and asynchronous execution

□  Abstract specification of data movement, loop parallelization □  Caching and synchronization support □  Management of data movement by compiler and runtime □  Implementations available, e.g. for Xeon Phi

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

53

Page 54: OpenHPI - Parallel Programming Concepts - Week 6

Autotuners

■  Optimize parallel code by generating many variants □  Try many or all optimization switches

◊  Loop unrolling, utilization of processor registers, … □  Rely on parallelization variations defined in the application

■  Automatically tested on target platform ■  Research shows promising results

□  Can be better than manually optimized code □  Optimization can fit to multiple execution environments □  Known examples for sparse and dense linear algebra libraries ◊  ATLAS (Automatically Tuned Linear Algebra Software)

54

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 55: OpenHPI - Parallel Programming Concepts - Week 6

Intel Math Kernel Library (MKL)

■  Intel library with heavily optimized functionality, for C & Fortran □  Linear algebra

◊ Basic Linear Algebra Subprograms (BLAS) API ◊  Follows standards in high-performance computing ◊  Vector-vector, matrix-vector, matrix-matrix operations

□  Fast Fourier Transforms (FFT)

◊  Single precision, double precision, complex, real, ... □  Vector math and statistics functions ◊  Random number generators and probability distributions ◊  Spline-based data fitting

■  High-level abstraction of functionality, parallelization completely transparent for the developer

55

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 56: OpenHPI - Parallel Programming Concepts - Week 6

Future Trends

■  Active research on next-generation hardware □  Driven by exa-scale efforts in supercomputing

□  Driven by combined power wall and memory wall □  Driven by shift in computer markets (desktop -> mobile)

■  Impact on software development will get more visible □  Hybrid computing is the future default

□  Heterogeneous mixture of CPU + specialized accelerators □  Old assumptions are broken (flat memory, constant access

time, homogeneous processing elements) □  Old programming models no longer match □  Extending the existing programming paradigms seems to work □  High-level specialized libraries get more relevance

56

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 57: OpenHPI - Parallel Programming Concepts - Week 6

Parallel Programming Concepts OpenHPI Course Summary

Dr. Peter Tröger + Teaching Team

Page 58: OpenHPI - Parallel Programming Concepts - Week 6

Course Organization

■  Week 1: Terminology and fundamental concepts □  Moore’s law, power wall, memory wall, ILP wall

□  Speedup vs. scaleup, Amdahl’s law, Flynn’s taxonomy, … ■  Week 2: Shared memory parallelism – The basics

□  Concurrency, race condition, semaphore, deadlock, monitor, … ■  Week 3: Shared memory parallelism – Programming

□  Threads, OpenMP, Cilk, Scala, … ■  Week 4: Accelerators

□  Hardware today, GPU Computing, OpenCL, … ■  Week 5: Distributed memory parallelism

□  CSP, Actor model, clusters, HPC, MPI, MapReduce, … ■  Week 6: Patterns and best practices

□  Foster’s methodology, Berkeley dwarfs, OPL collection, …

58

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 59: OpenHPI - Parallel Programming Concepts - Week 6

Week 1: The Free Lunch Is Over

■  Clock speed curve flattened in 2003 □  Heat, power,

leakage ■  Speeding up the serial

instruction execution through clock speed improvements no longer works

■  Additional issues □  ILP wall □  Memory wall

[Her

b Sut

ter,

2009

]

59

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 60: OpenHPI - Parallel Programming Concepts - Week 6

Three Ways Of Doing Anything Faster [Pfister]

■  Work harder (clock speed) Ø  Power wall problem Ø  Memory wall problem

■  Work smarter (optimization, caching) Ø  ILP wall problem Ø  Memory wall problem

■  Get help (parallelization) □  More cores per single CPU

□  Software needs to exploit them in the right way

Ø  Memory wall problem

Problem

CPU

Core

Core

Core

Core

Core

60

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 61: OpenHPI - Parallel Programming Concepts - Week 6

Parallelism on Different Levels

Program Program Program

Process Process Process Process Task

PE

Process Process Process Process Task Process Process Process Process Task

PE PE

PE

Memory

Node

Net

wor

k

PE PE

PE

Memory

PE PE

PE

Memory

PE PE

PE

Memory

PE PE

PE

Memory

61

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 62: OpenHPI - Parallel Programming Concepts - Week 6

The Parallel Programming Problem

■  Execution environment has a particular type (SIMD, MIMD, UMA, NUMA, …)

■  Execution environment maybe configurable (number of resources) ■  Parallel application must be mapped to available resources

Execution Environment Parallel Application Match ?

Configuration

Flexible

Type

62

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 63: OpenHPI - Parallel Programming Concepts - Week 6

Amdahl’s Law

63

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 64: OpenHPI - Parallel Programming Concepts - Week 6

Gustafson-Barsis’ Law (1988)

■  Gustafson and Barsis: People are typically not interested in the shortest execution time □  Rather solve a bigger problem in reasonable time

■  Problem size could then scale with the number of processors

□  Typical in simulation and farmer / worker problems □  Leads to larger parallel fraction with increasing N □  Serial part is usually fixed or grows slower

■  Maximum scaled speedup by N processors:

■  Linear speedup now becomes possible ■  Software needs to ensure that serial parts remain constant ■  Other models exist (e.g. Work-Span model, Karp-Flatt metric)

64

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

S =TSER +N · TPAR

TSER + TPAR

Page 65: OpenHPI - Parallel Programming Concepts - Week 6

Week 2: Concurrency vs. Parallelism

■  Concurrency means dealing with several things at once □  Programming concept for the developer

□  In shared-memory systems, implemented by time sharing ■  Parallelism means doing several things at once

□  Demands parallel hardware ■  Parallel programming is a misnomer

□  Concurrent programming aiming at parallel execution ■  Any parallel software is concurrent software

□  Note: Some researchers disagree, most practitioners agree ■  Concurrent software is not always parallel software

□  Many server applications achieve scalability by optimizing concurrency only (web server)

65

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Concurrency

Parallelism

Page 66: OpenHPI - Parallel Programming Concepts - Week 6

Parallelism [Mattson et al.]

■  Task □  Parallel program breaks a problem into tasks

■  Execution unit □  Representation of a concurrently running task (e.g. thread) □  Tasks are mapped to execution units

■  Processing element (PE)

□  Hardware element running one execution unit □  Depends on scenario - logical processor vs. core vs. machine □  Execution units run simultaneously on processing elements,

controlled by some scheduler ■  Synchronization - Mechanism to order activities of parallel tasks ■  Race condition - Program result depends on the scheduling order

66

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 67: OpenHPI - Parallel Programming Concepts - Week 6

Concurrency Issues

■  Mutual Exclusion □  The requirement that when one concurrent task is using a

shared resource, no other shall be allowed to do that ■  Deadlock

□  Two or more concurrent tasks are unable to proceed □  Each is waiting for one of the others to do something

■  Starvation □  A runnable task is overlooked indefinitely

□  Although it is able to proceed, it is never chosen to run ■  Livelock

□  Two or more concurrent tasks continuously change their states in response to changes in the other activities

□  No global progress for the application

67

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

67

Page 68: OpenHPI - Parallel Programming Concepts - Week 6

Week 3: Parallel Programming for Shared Memory

68

Process

Explicitly Shared Memory

■  Different programming models for concurrency with shared memory

■  Processes and threads mapped to processing elements (cores)

■  Task model supports more fine-grained parallelization than with native threads

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Memory

Process

Memory

Thread Thread

Task

Task

Task

Task

Concurrent Processes Concurrent Threads

Concurrent Tasks

Main Thread

Process

Memory

Main Thread

Process

Memory

Main Thread Thread Thread

Page 69: OpenHPI - Parallel Programming Concepts - Week 6

Input Data

Parallel

Processing

Result Data

Task Parallelism and Data Parallelism

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

69

Page 70: OpenHPI - Parallel Programming Concepts - Week 6

OpenMP

■  Programming with the fork-join model □  Master thread forks into declared tasks

□  Runtime environment may run them in parallel, based on dynamic mapping to threads from a pool

□  Worker task barrier before finalization (join)

70

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

[Wik

iped

ia]

Page 71: OpenHPI - Parallel Programming Concepts - Week 6

High-Level Concurrency

71 Microsoft Parallel Patterns Library java.util.concurrent

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 72: OpenHPI - Parallel Programming Concepts - Week 6

Partitioned Global Address Space

72

Place-shifting operations• at(p) S• at(p) e

… …… …

Activities

Local�Heap

Place�0

………

Activities

Local�Heap

Place�N

Global�Reference

Distributed heap• GlobalRef[T]• PlaceLocalHandle[T]

APGAS in X10: Places and Tasks

Task parallelism• async S

• finish S

Concurrency control within a place• when(c) S• atomic S

11

■  Parallel tasks, each operating in one place of the PGAS □  Direct variable access only in local place

■  Implementation strategy is flexible □  One operating system process per place, manages thread pool □  Work-stealing scheduler

[IBM

]

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 73: OpenHPI - Parallel Programming Concepts - Week 6

Week 4: Cheap Performance with Accelerators

■  Performance

■  Energy / Price

□  Cheap to buy and to maintain □  GFLOPS per watt: Fermi 1,5 / Kepler 5 / Maxwell 15 (2014)

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

73

0 200 400 600 800

1000 1200 1400

0 10000 20000 30000 40000 50000

Exec

uti

on T

ime

in

Mill

isec

ond

s

Problem Size (Number of Sudoku Places)

Intel E8500 CPU

AMD R800 GPU

NVIDIA GT200 GPU

lower means faster

GPU: Graphics Processing Unit (CPU of a graphics card)

Page 74: OpenHPI - Parallel Programming Concepts - Week 6

CPU vs. GPU Architecture

□  Some huge threads □  Branch prediction

□  1000+ light-weight threads □  Memory latency hiding

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

74

Control PE

PE

PE

PE

Cache

DRAM DRAM

CPU GPU „many-core“ „multi-core“

Page 75: OpenHPI - Parallel Programming Concepts - Week 6

OpenCL Platform Model

□  OpenCL exposes CPUs, GPUs, and other Accelerators as “devices” □  Each “device” contains one or more “compute units”, i.e. cores, SMs,...

□  Each “compute unit” contains one or more SIMD “processing elements”

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

75

Page 76: OpenHPI - Parallel Programming Concepts - Week 6

Best Practices for Performance Tuning

• Asynchronous, Recompute, Simple Algorithm Design

• Chaining, Overlap Transfer & Compute Memory Transfer

• Divergent Branching, Predication Control Flow

• Local Memory as Cache, rare resource Memory Types

• Coalescing, Bank Conflicts Memory Access

• Execution Size, Evaluation Sizing

• Shifting, Fused Multiply, Vector Types Instructions

• Native Math Functions, Build Options Precision OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

76

Page 77: OpenHPI - Parallel Programming Concepts - Week 6

Week 5: Shared Nothing

■  Clusters: Stand-alone machines connected by a local network □  Cost-effective technique for a large-scale parallel computer

□  Users are builders, have control over their system □  Synchronization much slower than in shared memory □  Task granularity becomes an issue

77

Processing Element

Task

Processing Element

Task

Mes

sage

Mes

sage

Mes

sage

Mes

sage

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Local Memory

Local Memory

Page 78: OpenHPI - Parallel Programming Concepts - Week 6

Shared Nothing

■  Supercomputers / Massively Parallel Processing (MPP) systems □  (Hierarchical) cluster with a lot of processors

□  Still standard hardware, but specialized setup □  High-performance interconnection network □  For massive data-parallel applications, mostly simulations

(weapons, climate, earthquakes, airplanes, car crashes, ...) ■  Examples (Nov 2013)

□  BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops

□  Tianhe-2, 3.1 million cores, 1 PB memory, 17.808 kW power, 33.86 PFlops (quadrillions calculations per second)

■  Annual ranking with the TOP500 list (www.top500.org)

78

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 79: OpenHPI - Parallel Programming Concepts - Week 6

Surface-To-Volume Effect

79

[nic

erw

eb.c

om]

■  Fine-grained decomposition for using all processing elements ?

■  Coarse-grained decomposition to reduce communication overhead ?

■  A tradeoff question !

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 80: OpenHPI - Parallel Programming Concepts - Week 6

Message Passing

■  Parallel programming paradigm for “shared nothing” environments □  Implementations for shared memory available,

but typically not the best approach ■  Users submit their message passing program & data as job

■  Cluster management system creates program instances

Instance 0

Instance 1

Instance 2 Instance

3

Execution Hosts

80

Cluster Management Software

Submission Host

Job

Appli-cation

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 81: OpenHPI - Parallel Programming Concepts - Week 6

Single Program Multiple Data (SPMD)

81

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

// … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, comm_size - 1); }

Input data

SPMD program

// … (determine rank and comm_size) … int token; if (rank != 0) {

// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);

} else { // Set the token's value if you are rank 0 token = -1;

} // Send your local token value to your ‘right’ neighbor

MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);

// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",

rank, token, comm_size - 1); }

// … (determine rank and comm_size) … int token; if (rank != 0) {

// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);

} else { // Set the token's value if you are rank 0 token = -1;

} // Send your local token value to your ‘right’ neighbor

MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);

// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",

rank, token, comm_size - 1); }

// … (determine rank and comm_size) … int token; if (rank != 0) {

// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);

} else { // Set the token's value if you are rank 0 token = -1;

} // Send your local token value to your ‘right’ neighbor

MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);

// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",

rank, token, comm_size - 1); }

// … (determine rank and comm_size) … int token; if (rank != 0) {

// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);

} else { // Set the token's value if you are rank 0 token = -1;

} // Send your local token value to your ‘right’ neighbor

MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);

// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",

rank, token, comm_size - 1); }

// … (determine rank and comm_size) … int token; if (rank != 0) {

// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);

} else { // Set the token's value if you are rank 0 token = -1;

} // Send your local token value to your ‘right’ neighbor

MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);

// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",

rank, token, comm_size - 1); }

Instance 0

Instance 1

Instance 2

Instance 3

Instance 4

Page 82: OpenHPI - Parallel Programming Concepts - Week 6

Actor Model

■  Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular Actor Formalism for Artificial Intelligence IJCAI 1973. □  Mathematical model for concurrent computation □  Actor as computational primitive

◊  Local decisions, concurrently sends / receives messages ◊ Has a mailbox for incoming messages ◊ Concurrently creates more actors

□  Asynchronous one-way message sending

□  Changing topology allowed, typically no order guarantees ◊  Recipient is identified by mailing address ◊  Actors can send their own identity to other actors

■  Available as programming language extension or library in many environments

82

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 83: OpenHPI - Parallel Programming Concepts - Week 6

Week 6: Patterns for Parallel Programming

■  Phases in creating a parallel program □  Finding Concurrency: Identify and

analyze exploitable concurrency □  Algorithm Structure: Structure the

algorithm to take advantage of potential concurrency

□  Supporting Structures: Define program structures and data structures needed for the code

□  Implementation Mechanisms: Threads, processes, messages, …

■  Each phase is a design space

83

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 84: OpenHPI - Parallel Programming Concepts - Week 6

Popular Applications vs. Dwarfs

Hot →  Cold

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

84

Page 85: OpenHPI - Parallel Programming Concepts - Week 6

Designing Parallel Algorithms [Foster]

■  Map workload problem on an execution environment □  Concurrency & locality for speedup, scalability

■  Four distinct stages of a methodological approach ■  A) Search for concurrency and scalability

□  Partitioning – Decompose computation and data into small tasks

□  Communication – Define necessary coordination of task execution

■  B) Search for locality and performance □  Agglomeration –

Consider performance and implementation costs □  Mapping –

Maximize processor utilization, minimize communication

85

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 86: OpenHPI - Parallel Programming Concepts - Week 6

86

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

And that’s it …

Page 87: OpenHPI - Parallel Programming Concepts - Week 6

The End

■  Parallel programming is exciting again! □  From massively parallel hardware to complex software

□  From abstract design patterns to specific languages □  From deadlock freedom to extreme performance tuning

■  Some general concepts are established □  Take this course as starting point

□  Learn from the high-performance computing community

■  Thanks for your participation □  Lively discussion, directly and in the forums, we learned a lot

□  Sorry for technical flaws and content errors ■  Please use the feedback link

87

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 88: OpenHPI - Parallel Programming Concepts - Week 6

Lecturer Contact

■  Operating Systems and Middleware Group at HPI http://www.dcl.hpi.uni-potsdam.de

■  Dr. Peter Tröger http://www.troeger.eu http://twitter.com/ptroeger http://www.linkedin.com/in/ptroeger [email protected]

■  M.Sc. Frank Feinbube

http://www.feinbube.de http://www.linkedin.com/in/feinbube [email protected]

88

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger