chapter 3 parallel programming models. abstraction machine level – looks at hardware, os, buffers...

Chapter 3

Parallel Programming Models

Abstraction• Machine Level– Looks at hardware, OS, buffers

• Architectural models– Looks at interconnection network, memory

organization, synchronous & asynchronous• Computational Model– Cost models, algorithm complexity, RAM vs. PRAM

• Programming Model– Uses programming language description of process

Control Flows

• Process– Address spaces differ - Distributed

• Thread– Shares address spaces – Shared Memory

• Created statically (like MPI-1) or dynamically during run time (MPI-2 allows this as well as pthreads).

Parallelization of a Program

• Decomposition of the computations– Can be done at many levels (ex. Pipelining).– Divide into tasks and identify dependencies

between tasks.– Can be done statically (at compile time) or

dynamically (at run time)– Number of tasks places an upper bound on the

parallelism that can be used– Granularity: the computation time of a task

Assignment of Tasks• The number of processes or threads does not

need to be the same as the number of processors• Load Balancing: each process/thread having the

same amount of work (computation, memory access, communication)

• Have a tasks that use the same memory execute on the same thread (good cache use)

• Scheduling: assignment of tasks to threads/processes

Assignment to Processors

• 1-1: map a process/thread to a unique processor

• many to 1: map several processes to a single processor. (Load balancing issues)

• OS or programmer done

Scheduling

• Precedence constraints– Dependencies between tasks

• Capacity constraints– A fixed number of processors

• Want to meet constraints and finish in minimum time

Levels of Parallelism

• Instruction Level• Data Level• Loop Level• Functional level

Instruction Level• Executing multiple instructions in parallel.

May have problems with dependencies– Flow dependency – if next instruction needs a

value computed by previous instruction– Anti-Dependency – if an instruction uses a value

from register or memory when the next instruction stores a value into that place (cannot reverse the order of instructions

– Output dependency – 2 instructions store into same location

Data Level

• Same process applied to different elements of a large data structure

• If these are independent, the can distribute the data among the processors

• One single control flow• SIMD

Loop Level

• If there are no dependencies between the iterations of a loop, then each iteration can be done independently, in parallel

• Similar to data parallelism

Functional Level

• Look at the parts of a program and determine which parts can be done independently.

• Use a dependency graph to find the dependencies/independencies

• Static or Dynamic assignment of tasks to processors– Dynamic would use a task pool

Explicit/Implicit Parallelism Expression

• Language dependent• Some languages hide the parallelism in the

language• For some languages, you must explicitly state

the parallelism

Parallelizing Compilers

• Takes a program in a sequential language and generates parallel code– Must analyze the dependencies and not violate them– Should provide good load balancing (difficult)– Minimize communication

• Functional Programming Languages– Express computations as evaluation of a function

with no side effects– Allows for parallel evaluation

More explicit/implicit

• Explicit parallelism/implicit distribution– The language explicitly states the parallelism in

the algorithm, but allows the system to assign the tasks to processors.

• Explicit assignment to processors – do not have to worry about communications

• Explicit Communication and Synchronization– MPI – additionally must explicitly state

communications and synchronization points

Parallel Programming Patterns• Process/Thread Creation• Fork-Join• Parbegin-Parend• SPMD, SIMD• Master-Slave (worker)• Client-Server• Pipelining• Task Pools• Producer-Consumer

Process/Thread Creation

• Static or Dynamic• Threads, traditionally dynamic• Processes, traditionally static, but dynamic has

become recently available

Fork-Join

• An existing thread can create a number of child processes with a fork.

• The child threads work in parallel.• Join waits for all the forked processes to

terminate.• Spawn/exit is similar

Parbegin-Parend

• Also called cobegin-coend• Each statement (blocks/function calls) in the

cobegin-coend block are to be executed in parallel.

• Statements after coend are not executed until all the parallel statement are complete.

SPMD – SIMD

• Single Program, Multiple Data vs. Single Instruction, Multiple Data

• Both use a number of threads/processes which apply the same program to different data

• SIMD executes the statements synchronously on different data

• SPMD executes the statements asynchronously

Master-Slave

• One thread/process that controls all the others

• If dynamic thread/process creation, the master is the one that usually does it.

• Master would “assign” the work to the workers and the workers would send the results to the master

Client-Server

• Multiple clients connected to a server that responds to requests

• Server could be satisfying requests in parallel (multiple requests being done in parallel or if the request is involved, a parallel solution to the request)

• The client would also do some work with response from server.

• Very good model for heterogeneous systems

Pipelining

• Output of one thread is the input to another thread

• A special type of functional decomposition• Another case where heterogeneous systems

would be useful

Task Pools

• Keep a collection of tasks to be done and the data to do it upon

• Thread/process can generate tasks to be added to the pool as well as obtaining a task when it is done with the current task

Producer Consumer

• Producer threads create data used as input by the consumer threads

• Data is stored in a common buffer that is accessed by producers and consumers

• Producer cannot add if buffer is full• Consumer cannot remove if buffer is empty

Array Data Distributions• 1-D– Blockwise• Each process gets ceil(n/p) elements of A, except for the

last process which gets n-(p-1)*ceil(n/p) elements• Alternatively, the first n%p processes get ceil(n/p)

elements while the rest get floor(n/p) elements.

– Cyclic• Process p gets data k*p+i (k=0..ceil(n/))

– Block cyclic• Distribute blocks of size b to processes in a cyclic

manner

2-D Array distribution

• Blockwise distribution rows or columns• Cyclic distribution of rows or columns• Blockwise-cyclic distribution of rows or

columns

Checkerboard

• Take an array of size n x m• Overlay a grid of size g x f – g<=n– f<=m– More easily seen if n is a multiple of g and m is a

multiple of f• Blockwise Checkerboard– Assign each n/g x m/f submatrix to a processor

Cyclic Checkerboard

• Take each item in a n/g x m/f submatrix and assign it in a cyclic manner.

• Block-Cyclic checkerboard– Take each n/g x m/f submatrix and assign all the

data in the submatrix to a processor in a cyclic fashion

Information Exchange

• Shared Variables– Used in shared memory– When thread T1 wants to share information with

thread T2, then T1 writes the information into a variable that is shared with T2

– Must avoid 2 or more processes reading or writing to the same variable at the same time (race condition)

– Leads to non-Deterministic behavior.

Critical Sections

• Sections of code where there may be concurrent accesses to shared variables

• Must make these sections mutually exclusive– Only one process can be executing this section at any

one time• Lock mechanism is used to keep sections mutually

exclusive– Process checks to see if section is “open”– If it is, then “lock” it and execute (unlock when done)– If not, wait until unlocked

Communication Operations• Single Transfer – Pi sends a message to Pj

• Single Broadcast – one process sends the same data to all other processes

• Single accumulation – Many values operated on to make a single value that is placed in root

• Gather – Each process provides a block of data to a common single process

• Scatter – root process sends a separate block of a large data structure to every other process

More Communications

• Multi-Broadcast – Every process sends data to every other process so every process has all the data that was spread out among the processes

• Multi-Accumulate – accumulate, but every process gets the result

• Total Exchange-Each process provides p-data blocks. The ith data block is sent to pi. Each processor receives the blocks and builds the structure with the data in i order.

Applications

• Parallel Matrix-Vector Product– Ab=c where A is n x m and b, c are m– Want A to be in contiguous memory• A single array, not an array of arrays

– Have blocks of rows with allof b calculate a block of c• Used if A is stored row-wise

– Have blocks of columns with a block of b compute columns that need to be summed.• Used if A is stored column-wise

Processes and Threads

• Process – a program in execution– Includes code, program data on stack or heap,

values of registers, PC.– Assigned to processor or core for execution– If there are more processes than resources

(processors or memory) for all, execute in a round-robin time-shared method

– Context switch – changing from one process to another executing on processor.

Fork

• The Unix fork command– Creates a new process– Makes a copy of the program – Copy starts at statement after the fork– NOT shared memory model – Distributed memory

model– Can take a while to execute

Threads

• Share a single address space• Best with physically shared memory• Easier to get started than a process – no copy

of code space• Two types– Kernel threads – managed by the OS– User threads – managed by a thread library

Thread Execution

• If user threads are executed by a thread library/scheduler, (no OS support for threads) then all the threads are part of one process that is scheduled by the OS– Only one thread executed at a time even if there are multiple

processors• If OS has thread management, then threads can be

scheduled by OS and multiple threads can execute concurrently

• Or, Thread scheduler can map user threads to kernel threads (several user threads may map to one kernel thread)

Thread States

• Newly generated• Executable• Running• Waiting• Finished• Threads transition from state to state based

on events (start, interrupt, end, block, unblock, assign-to-processor)

Synchronization

• Locks– A process “locks” a shared variable at the

beginning of a critical section• Lock allows process to proceed if shared variable is

unlocked• Process blocked if variable is locked until variable is

unlocked• Locking is an “atomic” process.

Semaphore

• Usually a binary type but can be integer• wait(s)– Waits until the value of s is 1 (or greater)– When it is, decreases s by 1 and continues

• signal(s)– Increments s by 1

Barrier Synchronization

• A way to have every process wait until every process is at a certain point

• Assures the state of every process before certain code is executed

Condition Synchronization

• A thread is blocked until a given condition is established– If condition is not true, then put into blocked state– When condition true, moved from blocked to

ready (not necessarily directly to a processor)– Since other processes may be executing, by the

time this process gets to a processor, the condition may no longer be true• So, must check condition after condition satisfied

Efficient Thread Programs

• Proper number of threads– Consider degree of parallelism in application– Number of processors– Size of shared cache

• Avoid synchronization as much as possible– Make critical section as small as possible

• Watch for deadlock conditions

Memory Access

• Must consider writing values to shared memory that is held in local caches

• False sharing– Consider 2 processes writing to different memory locations– SHOULD not be an issue since not shared by two cache

memories– HOWEVER, if the memory locations are close to each

other, they may be in the same cache line and actually have the different locations both be in the different caches

chapter 3 parallel programming models. abstraction machine level – looks at hardware, os, buffers...

Documents

dynamic assignment of

run timenumber of tasks

memory organization

parallel codemust

memory access

computation time

dependency graph

dependenciesflow dependency