chapter 3 parallel programming models. abstraction machine level – looks at hardware, os, buffers...

45
Chapter 3 Parallel Programming Models

Upload: melvyn-anthony

Post on 02-Jan-2016

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Chapter 3

Parallel Programming Models

Page 2: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Abstraction• Machine Level– Looks at hardware, OS, buffers

• Architectural models– Looks at interconnection network, memory

organization, synchronous & asynchronous• Computational Model– Cost models, algorithm complexity, RAM vs. PRAM

• Programming Model– Uses programming language description of process

Page 3: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Control Flows

• Process– Address spaces differ - Distributed

• Thread– Shares address spaces – Shared Memory

• Created statically (like MPI-1) or dynamically during run time (MPI-2 allows this as well as pthreads).

Page 4: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Parallelization of a Program

• Decomposition of the computations– Can be done at many levels (ex. Pipelining).– Divide into tasks and identify dependencies

between tasks.– Can be done statically (at compile time) or

dynamically (at run time)– Number of tasks places an upper bound on the

parallelism that can be used– Granularity: the computation time of a task

Page 5: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Assignment of Tasks• The number of processes or threads does not

need to be the same as the number of processors• Load Balancing: each process/thread having the

same amount of work (computation, memory access, communication)

• Have a tasks that use the same memory execute on the same thread (good cache use)

• Scheduling: assignment of tasks to threads/processes

Page 6: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Assignment to Processors

• 1-1: map a process/thread to a unique processor

• many to 1: map several processes to a single processor. (Load balancing issues)

• OS or programmer done

Page 7: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Scheduling

• Precedence constraints– Dependencies between tasks

• Capacity constraints– A fixed number of processors

• Want to meet constraints and finish in minimum time

Page 8: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Levels of Parallelism

• Instruction Level• Data Level• Loop Level• Functional level

Page 9: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Instruction Level• Executing multiple instructions in parallel.

May have problems with dependencies– Flow dependency – if next instruction needs a

value computed by previous instruction– Anti-Dependency – if an instruction uses a value

from register or memory when the next instruction stores a value into that place (cannot reverse the order of instructions

– Output dependency – 2 instructions store into same location

Page 10: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Data Level

• Same process applied to different elements of a large data structure

• If these are independent, the can distribute the data among the processors

• One single control flow• SIMD

Page 11: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Loop Level

• If there are no dependencies between the iterations of a loop, then each iteration can be done independently, in parallel

• Similar to data parallelism

Page 12: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Functional Level

• Look at the parts of a program and determine which parts can be done independently.

• Use a dependency graph to find the dependencies/independencies

• Static or Dynamic assignment of tasks to processors– Dynamic would use a task pool

Page 13: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Explicit/Implicit Parallelism Expression

• Language dependent• Some languages hide the parallelism in the

language• For some languages, you must explicitly state

the parallelism

Page 14: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Parallelizing Compilers

• Takes a program in a sequential language and generates parallel code– Must analyze the dependencies and not violate them– Should provide good load balancing (difficult)– Minimize communication

• Functional Programming Languages– Express computations as evaluation of a function

with no side effects– Allows for parallel evaluation

Page 15: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

More explicit/implicit

• Explicit parallelism/implicit distribution– The language explicitly states the parallelism in

the algorithm, but allows the system to assign the tasks to processors.

• Explicit assignment to processors – do not have to worry about communications

• Explicit Communication and Synchronization– MPI – additionally must explicitly state

communications and synchronization points

Page 16: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Parallel Programming Patterns• Process/Thread Creation• Fork-Join• Parbegin-Parend• SPMD, SIMD• Master-Slave (worker)• Client-Server• Pipelining• Task Pools• Producer-Consumer

Page 17: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Process/Thread Creation

• Static or Dynamic• Threads, traditionally dynamic• Processes, traditionally static, but dynamic has

become recently available

Page 18: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Fork-Join

• An existing thread can create a number of child processes with a fork.

• The child threads work in parallel.• Join waits for all the forked processes to

terminate.• Spawn/exit is similar

Page 19: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Parbegin-Parend

• Also called cobegin-coend• Each statement (blocks/function calls) in the

cobegin-coend block are to be executed in parallel.

• Statements after coend are not executed until all the parallel statement are complete.

Page 20: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

SPMD – SIMD

• Single Program, Multiple Data vs. Single Instruction, Multiple Data

• Both use a number of threads/processes which apply the same program to different data

• SIMD executes the statements synchronously on different data

• SPMD executes the statements asynchronously

Page 21: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Master-Slave

• One thread/process that controls all the others

• If dynamic thread/process creation, the master is the one that usually does it.

• Master would “assign” the work to the workers and the workers would send the results to the master

Page 22: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Client-Server

• Multiple clients connected to a server that responds to requests

• Server could be satisfying requests in parallel (multiple requests being done in parallel or if the request is involved, a parallel solution to the request)

• The client would also do some work with response from server.

• Very good model for heterogeneous systems

Page 23: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Pipelining

• Output of one thread is the input to another thread

• A special type of functional decomposition• Another case where heterogeneous systems

would be useful

Page 24: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Task Pools

• Keep a collection of tasks to be done and the data to do it upon

• Thread/process can generate tasks to be added to the pool as well as obtaining a task when it is done with the current task

Page 25: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Producer Consumer

• Producer threads create data used as input by the consumer threads

• Data is stored in a common buffer that is accessed by producers and consumers

• Producer cannot add if buffer is full• Consumer cannot remove if buffer is empty

Page 26: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Array Data Distributions• 1-D– Blockwise• Each process gets ceil(n/p) elements of A, except for the

last process which gets n-(p-1)*ceil(n/p) elements• Alternatively, the first n%p processes get ceil(n/p)

elements while the rest get floor(n/p) elements.

– Cyclic• Process p gets data k*p+i (k=0..ceil(n/))

– Block cyclic• Distribute blocks of size b to processes in a cyclic

manner

Page 27: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

2-D Array distribution

• Blockwise distribution rows or columns• Cyclic distribution of rows or columns• Blockwise-cyclic distribution of rows or

columns

Page 28: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Checkerboard

• Take an array of size n x m• Overlay a grid of size g x f – g<=n– f<=m– More easily seen if n is a multiple of g and m is a

multiple of f• Blockwise Checkerboard– Assign each n/g x m/f submatrix to a processor

Page 29: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Cyclic Checkerboard

• Take each item in a n/g x m/f submatrix and assign it in a cyclic manner.

• Block-Cyclic checkerboard– Take each n/g x m/f submatrix and assign all the

data in the submatrix to a processor in a cyclic fashion

Page 30: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Information Exchange

• Shared Variables– Used in shared memory– When thread T1 wants to share information with

thread T2, then T1 writes the information into a variable that is shared with T2

– Must avoid 2 or more processes reading or writing to the same variable at the same time (race condition)

– Leads to non-Deterministic behavior.

Page 31: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Critical Sections

• Sections of code where there may be concurrent accesses to shared variables

• Must make these sections mutually exclusive– Only one process can be executing this section at any

one time• Lock mechanism is used to keep sections mutually

exclusive– Process checks to see if section is “open”– If it is, then “lock” it and execute (unlock when done)– If not, wait until unlocked

Page 32: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Communication Operations• Single Transfer – Pi sends a message to Pj

• Single Broadcast – one process sends the same data to all other processes

• Single accumulation – Many values operated on to make a single value that is placed in root

• Gather – Each process provides a block of data to a common single process

• Scatter – root process sends a separate block of a large data structure to every other process

Page 33: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

More Communications

• Multi-Broadcast – Every process sends data to every other process so every process has all the data that was spread out among the processes

• Multi-Accumulate – accumulate, but every process gets the result

• Total Exchange-Each process provides p-data blocks. The ith data block is sent to pi. Each processor receives the blocks and builds the structure with the data in i order.

Page 34: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Applications

• Parallel Matrix-Vector Product– Ab=c where A is n x m and b, c are m– Want A to be in contiguous memory• A single array, not an array of arrays

– Have blocks of rows with allof b calculate a block of c• Used if A is stored row-wise

– Have blocks of columns with a block of b compute columns that need to be summed.• Used if A is stored column-wise

Page 35: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Processes and Threads

• Process – a program in execution– Includes code, program data on stack or heap,

values of registers, PC.– Assigned to processor or core for execution– If there are more processes than resources

(processors or memory) for all, execute in a round-robin time-shared method

– Context switch – changing from one process to another executing on processor.

Page 36: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Fork

• The Unix fork command– Creates a new process– Makes a copy of the program – Copy starts at statement after the fork– NOT shared memory model – Distributed memory

model– Can take a while to execute

Page 37: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Threads

• Share a single address space• Best with physically shared memory• Easier to get started than a process – no copy

of code space• Two types– Kernel threads – managed by the OS– User threads – managed by a thread library

Page 38: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Thread Execution

• If user threads are executed by a thread library/scheduler, (no OS support for threads) then all the threads are part of one process that is scheduled by the OS– Only one thread executed at a time even if there are multiple

processors• If OS has thread management, then threads can be

scheduled by OS and multiple threads can execute concurrently

• Or, Thread scheduler can map user threads to kernel threads (several user threads may map to one kernel thread)

Page 39: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Thread States

• Newly generated• Executable• Running• Waiting• Finished• Threads transition from state to state based

on events (start, interrupt, end, block, unblock, assign-to-processor)

Page 40: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Synchronization

• Locks– A process “locks” a shared variable at the

beginning of a critical section• Lock allows process to proceed if shared variable is

unlocked• Process blocked if variable is locked until variable is

unlocked• Locking is an “atomic” process.

Page 41: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Semaphore

• Usually a binary type but can be integer• wait(s)– Waits until the value of s is 1 (or greater)– When it is, decreases s by 1 and continues

• signal(s)– Increments s by 1

Page 42: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Barrier Synchronization

• A way to have every process wait until every process is at a certain point

• Assures the state of every process before certain code is executed

Page 43: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Condition Synchronization

• A thread is blocked until a given condition is established– If condition is not true, then put into blocked state– When condition true, moved from blocked to

ready (not necessarily directly to a processor)– Since other processes may be executing, by the

time this process gets to a processor, the condition may no longer be true• So, must check condition after condition satisfied

Page 44: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Efficient Thread Programs

• Proper number of threads– Consider degree of parallelism in application– Number of processors– Size of shared cache

• Avoid synchronization as much as possible– Make critical section as small as possible

• Watch for deadlock conditions

Page 45: Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Memory Access

• Must consider writing values to shared memory that is held in local caches

• False sharing– Consider 2 processes writing to different memory locations– SHOULD not be an issue since not shared by two cache

memories– HOWEVER, if the memory locations are close to each

other, they may be in the same cache line and actually have the different locations both be in the different caches