seminar shared memory programming
DESCRIPTION
Seminar Shared memory Programming. Week 7 _ 2. Nguyễn Thị Xuân Mai50501620 Phạm Thị Trúc Linh50501491 Vũ Thị Mai Phương50502194 Nguyễn Minh Quân50502771 Phí Văn Tuấn50503347. Programming with shared memory. Creating Concurrent Processes. 2. Threads. 1. - PowerPoint PPT PresentationTRANSCRIPT
LOGO
Seminar
Shared memory Programming
Week 7 _ 2
Nguyễn Thị Xuacircn Mai 50501620Phạm Thị Truacutec Linh 50501491Vũ Thị Mai Phương 50502194Nguyễn Minh Quacircn 50502771Phiacute Văn Tuấn 50503347
Programming with shared memory
Shared memory multiprocessor1
Constructs for specifying parallelism2
Creating Concurrent Processes
Threads2
Sharing data3
Programming with shared memory
Creating Shared Data
Accessing Shared Data
Locks Deadlock
Semaphores Monitor Condition Variables
Language Constructs for Parallelism
Dependency Analysis4
Shared Data in system with caches
Shared Memory Multiprocessors
bull In a shared memory system any memory location can be accessible by any of the processors
bull A single address space exists meaning that each memory location is given a unique address within a single range of addresses
bull Shared-memory behavior is determined by both program order and memory access order
Shared memory multiprocessor
Shared Memory Multiprocessors
bull For a small number of processors a common architecture is the single bus architecture in which all processors and memory modules attach to the same set of wires(the bus)
Shared memory multiprocessor
Shared Memory Multiprocessors
Shared memory multiprocessor
ln
l5l4l3l2l1
Memory
l5 l5l4l3l2l1
l6J5
J4
J3
J2
J1
K5
K4
K3
K2
K1
K6
Programorder(PO)
PO1 PO2POn
Share Memory(A global memory oder)
Switch
(a) A uniprocessor system (b) A multiprocessor system
Constructs for Specifying Parallelism
Creating Concurrent Processes1
Threads2
Constructs for specifying Parallelism
Creating Concurrent Proceses
bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements
bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends
bull When JOIN statements have been reached processing continues in a sequential fashion
Constructs for specifying Parallelism
Creating Concurrent Proceses(cont)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Operating systems such as UNIX are based upon the notion of a process
bull On a single processor system the processor has to be time shared between processes switching from one process to another
bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete
bull On a multiprocessors there is an opportunity to execute process truly concurrently
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID
bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process
bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as
wait(statusp)delays caller until signal received or one of
itshellipchild process terminates or stophellip
exit(status)terminates a process
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Programming with shared memory
Shared memory multiprocessor1
Constructs for specifying parallelism2
Creating Concurrent Processes
Threads2
Sharing data3
Programming with shared memory
Creating Shared Data
Accessing Shared Data
Locks Deadlock
Semaphores Monitor Condition Variables
Language Constructs for Parallelism
Dependency Analysis4
Shared Data in system with caches
Shared Memory Multiprocessors
bull In a shared memory system any memory location can be accessible by any of the processors
bull A single address space exists meaning that each memory location is given a unique address within a single range of addresses
bull Shared-memory behavior is determined by both program order and memory access order
Shared memory multiprocessor
Shared Memory Multiprocessors
bull For a small number of processors a common architecture is the single bus architecture in which all processors and memory modules attach to the same set of wires(the bus)
Shared memory multiprocessor
Shared Memory Multiprocessors
Shared memory multiprocessor
ln
l5l4l3l2l1
Memory
l5 l5l4l3l2l1
l6J5
J4
J3
J2
J1
K5
K4
K3
K2
K1
K6
Programorder(PO)
PO1 PO2POn
Share Memory(A global memory oder)
Switch
(a) A uniprocessor system (b) A multiprocessor system
Constructs for Specifying Parallelism
Creating Concurrent Processes1
Threads2
Constructs for specifying Parallelism
Creating Concurrent Proceses
bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements
bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends
bull When JOIN statements have been reached processing continues in a sequential fashion
Constructs for specifying Parallelism
Creating Concurrent Proceses(cont)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Operating systems such as UNIX are based upon the notion of a process
bull On a single processor system the processor has to be time shared between processes switching from one process to another
bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete
bull On a multiprocessors there is an opportunity to execute process truly concurrently
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID
bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process
bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as
wait(statusp)delays caller until signal received or one of
itshellipchild process terminates or stophellip
exit(status)terminates a process
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Sharing data3
Programming with shared memory
Creating Shared Data
Accessing Shared Data
Locks Deadlock
Semaphores Monitor Condition Variables
Language Constructs for Parallelism
Dependency Analysis4
Shared Data in system with caches
Shared Memory Multiprocessors
bull In a shared memory system any memory location can be accessible by any of the processors
bull A single address space exists meaning that each memory location is given a unique address within a single range of addresses
bull Shared-memory behavior is determined by both program order and memory access order
Shared memory multiprocessor
Shared Memory Multiprocessors
bull For a small number of processors a common architecture is the single bus architecture in which all processors and memory modules attach to the same set of wires(the bus)
Shared memory multiprocessor
Shared Memory Multiprocessors
Shared memory multiprocessor
ln
l5l4l3l2l1
Memory
l5 l5l4l3l2l1
l6J5
J4
J3
J2
J1
K5
K4
K3
K2
K1
K6
Programorder(PO)
PO1 PO2POn
Share Memory(A global memory oder)
Switch
(a) A uniprocessor system (b) A multiprocessor system
Constructs for Specifying Parallelism
Creating Concurrent Processes1
Threads2
Constructs for specifying Parallelism
Creating Concurrent Proceses
bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements
bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends
bull When JOIN statements have been reached processing continues in a sequential fashion
Constructs for specifying Parallelism
Creating Concurrent Proceses(cont)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Operating systems such as UNIX are based upon the notion of a process
bull On a single processor system the processor has to be time shared between processes switching from one process to another
bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete
bull On a multiprocessors there is an opportunity to execute process truly concurrently
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID
bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process
bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as
wait(statusp)delays caller until signal received or one of
itshellipchild process terminates or stophellip
exit(status)terminates a process
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Shared Memory Multiprocessors
bull In a shared memory system any memory location can be accessible by any of the processors
bull A single address space exists meaning that each memory location is given a unique address within a single range of addresses
bull Shared-memory behavior is determined by both program order and memory access order
Shared memory multiprocessor
Shared Memory Multiprocessors
bull For a small number of processors a common architecture is the single bus architecture in which all processors and memory modules attach to the same set of wires(the bus)
Shared memory multiprocessor
Shared Memory Multiprocessors
Shared memory multiprocessor
ln
l5l4l3l2l1
Memory
l5 l5l4l3l2l1
l6J5
J4
J3
J2
J1
K5
K4
K3
K2
K1
K6
Programorder(PO)
PO1 PO2POn
Share Memory(A global memory oder)
Switch
(a) A uniprocessor system (b) A multiprocessor system
Constructs for Specifying Parallelism
Creating Concurrent Processes1
Threads2
Constructs for specifying Parallelism
Creating Concurrent Proceses
bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements
bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends
bull When JOIN statements have been reached processing continues in a sequential fashion
Constructs for specifying Parallelism
Creating Concurrent Proceses(cont)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Operating systems such as UNIX are based upon the notion of a process
bull On a single processor system the processor has to be time shared between processes switching from one process to another
bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete
bull On a multiprocessors there is an opportunity to execute process truly concurrently
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID
bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process
bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as
wait(statusp)delays caller until signal received or one of
itshellipchild process terminates or stophellip
exit(status)terminates a process
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Shared Memory Multiprocessors
bull For a small number of processors a common architecture is the single bus architecture in which all processors and memory modules attach to the same set of wires(the bus)
Shared memory multiprocessor
Shared Memory Multiprocessors
Shared memory multiprocessor
ln
l5l4l3l2l1
Memory
l5 l5l4l3l2l1
l6J5
J4
J3
J2
J1
K5
K4
K3
K2
K1
K6
Programorder(PO)
PO1 PO2POn
Share Memory(A global memory oder)
Switch
(a) A uniprocessor system (b) A multiprocessor system
Constructs for Specifying Parallelism
Creating Concurrent Processes1
Threads2
Constructs for specifying Parallelism
Creating Concurrent Proceses
bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements
bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends
bull When JOIN statements have been reached processing continues in a sequential fashion
Constructs for specifying Parallelism
Creating Concurrent Proceses(cont)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Operating systems such as UNIX are based upon the notion of a process
bull On a single processor system the processor has to be time shared between processes switching from one process to another
bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete
bull On a multiprocessors there is an opportunity to execute process truly concurrently
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID
bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process
bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as
wait(statusp)delays caller until signal received or one of
itshellipchild process terminates or stophellip
exit(status)terminates a process
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Shared Memory Multiprocessors
Shared memory multiprocessor
ln
l5l4l3l2l1
Memory
l5 l5l4l3l2l1
l6J5
J4
J3
J2
J1
K5
K4
K3
K2
K1
K6
Programorder(PO)
PO1 PO2POn
Share Memory(A global memory oder)
Switch
(a) A uniprocessor system (b) A multiprocessor system
Constructs for Specifying Parallelism
Creating Concurrent Processes1
Threads2
Constructs for specifying Parallelism
Creating Concurrent Proceses
bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements
bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends
bull When JOIN statements have been reached processing continues in a sequential fashion
Constructs for specifying Parallelism
Creating Concurrent Proceses(cont)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Operating systems such as UNIX are based upon the notion of a process
bull On a single processor system the processor has to be time shared between processes switching from one process to another
bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete
bull On a multiprocessors there is an opportunity to execute process truly concurrently
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID
bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process
bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as
wait(statusp)delays caller until signal received or one of
itshellipchild process terminates or stophellip
exit(status)terminates a process
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Constructs for Specifying Parallelism
Creating Concurrent Processes1
Threads2
Constructs for specifying Parallelism
Creating Concurrent Proceses
bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements
bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends
bull When JOIN statements have been reached processing continues in a sequential fashion
Constructs for specifying Parallelism
Creating Concurrent Proceses(cont)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Operating systems such as UNIX are based upon the notion of a process
bull On a single processor system the processor has to be time shared between processes switching from one process to another
bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete
bull On a multiprocessors there is an opportunity to execute process truly concurrently
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID
bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process
bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as
wait(statusp)delays caller until signal received or one of
itshellipchild process terminates or stophellip
exit(status)terminates a process
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Creating Concurrent Proceses
bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements
bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends
bull When JOIN statements have been reached processing continues in a sequential fashion
Constructs for specifying Parallelism
Creating Concurrent Proceses(cont)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Operating systems such as UNIX are based upon the notion of a process
bull On a single processor system the processor has to be time shared between processes switching from one process to another
bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete
bull On a multiprocessors there is an opportunity to execute process truly concurrently
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID
bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process
bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as
wait(statusp)delays caller until signal received or one of
itshellipchild process terminates or stophellip
exit(status)terminates a process
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Creating Concurrent Proceses(cont)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Operating systems such as UNIX are based upon the notion of a process
bull On a single processor system the processor has to be time shared between processes switching from one process to another
bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete
bull On a multiprocessors there is an opportunity to execute process truly concurrently
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID
bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process
bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as
wait(statusp)delays caller until signal received or one of
itshellipchild process terminates or stophellip
exit(status)terminates a process
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
UNIX Heavyweight Processes
bull Operating systems such as UNIX are based upon the notion of a process
bull On a single processor system the processor has to be time shared between processes switching from one process to another
bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete
bull On a multiprocessors there is an opportunity to execute process truly concurrently
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID
bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process
bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as
wait(statusp)delays caller until signal received or one of
itshellipchild process terminates or stophellip
exit(status)terminates a process
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
UNIX Heavyweight Processes
bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID
bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process
bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as
wait(statusp)delays caller until signal received or one of
itshellipchild process terminates or stophellip
exit(status)terminates a process
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
UNIX Heavyweight Processes
bull Hence a single child process can be created by
pid = fork() fork
hellip Code to be excuted by both child and parenthellip
if (pid == 0) exit(0)else wait(0) join
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
UNIX Heavyweight Processes
bull If the child is to execute different code we could use
pid = fork() if (pid == 0)
hellip code to be executed by slave hellip else
hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)
Constructs for specifying Parallelism
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
UNIX Heavyweight Processes
bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially
bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate
Constructs for specifying Parallelism
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Threads
Thread a thread of execution is a fork of a
computer program into two or more concurrently running tasks
Thread mechanism Allow to share the same memory space amp
global variables
Constructs for specifying Parallelism
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Context Switching
Interaction
Address Space
State Infomation
Dependence bull processes are typically independent while threads exist as subsets of a process
bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources
bull processes have separate address spaces where threads share their address space
bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing
bullContext switching between threads in the same process is typically faster than context switching between processes
Processes amp Threads
Interupt Routines
File
IP
Code Heap
Stack
IPStackInterupt Routines
File
IP
Code Heap
Stack Thread
Process
Constructs for specifying Parallelism
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Multithreaded Processor Model
Analyze performance of system Latency L communication latency
experienced with remote memory access network delay cache-miss penalty delays caused by
contentions in split transactions
Number of threads N Number of thread that can be interleaved in a processor
Context of a thread =PCregister set required context status word hellip
Context switch overhead C time lost in performing context switch in a processor
Switch mechanism number of processor states needed to maintain active threads
Interval between context switches run length (cycles between context switch triggered by remote reference)
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Multithreaded Computation
Initial Scheduling overhead Thread Synchronization overhead
Thread of Parallel Computation
Variable
Computation
The concept of multithreading in MPP system
Processor efficiency Busy do useful work Context switch suspend current
context amp switch to another Idle when all availble context
suspended (blocked)
Efficient = Busy (busy + switching + idle)
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Abtract Processor Model
wwwthemegallerycom Company Logo
Multiple-context processor model with one thread per context
PC
PSW
PC
PSW
PC
PSW
ALU Local memory reference
Remote memory reference
Register Files
N Contexts
1 Thread context
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Context-switching policies
wwwthemegallerycom Company Logo
Switch on cache miss when encoutering a cache miss
Switch on every load switching on every load operation independent of whether it will cause a miss or not
Switch on every instruction switching on every instruction insependent of whether or not it is a load
Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Pthread Thread
History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different
StandardIEEE POSIX 10031c
standard (1995)
Constructs for specifying Parallelism
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Executing a Pthread Thread
int pthread_create(pthread_t thread pthread_attr_t attr
void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain
execute code for new thread1048698arg a single argument is passed for start_routine
Pthread_t thread Hanndle of specia Pthread datatype
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Executing a Pthread Thread(cont)
pthread_exit(void status) Terminate amp destroy a thread
pthread_cancel() Thread is destroyed by another process
int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having
thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Detached ThreadThere are cases in which threads can
be terminated without needed of pthread_join
Detached Thread
When Detached Thread teminate they are destroyed amp their resource released
=gt More efficient
Main program
Pthread_create()
Termination
Thread
Pthread_create()
Pthread_create() Termination
Termination
Thread
Thread
Constructs for specifying Parallelism
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Thread Pools
A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals
Constructs for specifying Parallelism
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Thread ndash safe routines
System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results
Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions
Constructs for specifying Parallelism
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Thread ndash safe routines (cont)
Suppose that your application creates several threads each of which makes a call to the same library routine
This library routine accessesmodifies a global structure or location in memory
As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time
If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe
fe
Constructs for specifying Parallelism
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Sharing DataCreating Shared Data1
Accessing Shared Data2
Locks
Deadlock
Semaphores
Deadlock
Condition Variables
Language Constructs for Parallelism3
Dependency Analysis4
Shared Data in system with caches5
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Sharing Data
Text in here
Text in here
Conditions for Deadlock
1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released
2 Hold-and-wait bull Thread holds one resource while waits for another
3 No preemption bull Resources are released voluntarily after completion
4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs
ALL four conditions MUST hold
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Handing Deadlock
Text
Text
Text
Txt
Deadlock prevention Deadlock avoidance
Deadlock detection and recovery
Ignore
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Deadlock
88b n-processes deadlock 88a 2 processes dealock
R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh
Example
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
LOGO
Semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
sEMaPHoRE
A positive integer operated upon by 2 operations P amp V
The value is the number of the units of the resource which are free
A binary semaphore has value 0 or 1A general semaphore can take on
positive values other than 0 and 1
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
sEMaPHoRE
P amp V operations are performed indivisibly
P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue
V(s) increments s by 1 to release one of the waiting processes (if any)
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
sEMaPHoRE
The first process reach its P(s) operation or to be accepted will set the semaphore to 0
Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
sEMaPHoRE
When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Monitor
Disadvange of Semaphorebull Although semaphore provide a
convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur
bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer
Shared memory multiprocessor
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Monitor
Right codehellipWait(mutex)critical sectionSignal(mutex)hellip
Example wrong Semaphore
Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur
bull Both processes are simultaneously active will cause a deadlock
Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type
bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only
method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events
conflict
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A type or abstract data type encapsulates private data with public method to operate on that data
bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor
bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Monitor
bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()
procedure P2()
procedure Pn()
initialization_code ()
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Usage Monitor
bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly
bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization
bull Some addiontional synchronization ldquotailor moderdquo use conditional construct
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Conditional type
bull Declare conditional xybull Only operations that can be invoke on a
conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until
another process invokesxsignal()the process invoking this operation resumes exatly one
suspended processbull if no process is suspended xsignal() has
no effect
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM
Structure Monitor conditional type
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Condition variables
Condition variablesAllow threads to synchronize based
upon the actual value of data Without condition variables the
programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity
Always used in conjunction with a mutex lock
Sharing Data
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Pthread Condition Variables
int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified
condition variable cond (if any threads are blocked on cond)
int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)
initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised
Pthread_cond_t cond Declare condition variable
Sharing Data
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Pthread Condition Variables(cont)
int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable
cond
int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in
effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined
int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)
int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)
block on a condition variable the second one alow to appoint timeout
Sharing Data
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Sequence for using condition variable - example
This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value
Sharing Data
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Sequence for using condition variable - example (cont)
include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK
main
Thread 23
void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)
void watch_count(void t) pthread_mutex_lock(ampcount_mutex)
if (countltCOUNT_LIMIT)
pthread_cond_wait (hellip)
count += 125
pthread_mutex_unlock(hellip) pthread_exit(NULL)
Thread 1
Sharing Data
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment
Creating Shared Data
Each process has its own virtual address space within the virtualmemory management system
Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Creating Shared Data (tt)
shmget ()
Create shared memory segment
Return value is shared memory ID
shmat()
Attach shared segment to the data segmentof the calling process
Return the starting address of the data segment
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Accessing Shared Data
Accessing Shared Data needs careful control if the data is everaltered by a process
Problem CONFLICT
o Reading the variable by different process does not cause CONFLICT
o But writing new value CONFLICT
EX Consider two processes each of which is to add 1 two a shareddata item x
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Instruction Process 1 Process 2
x = x + 1 read x read x
Compute x + 1 Compute x + 1
write to x write to xtime
Conflict in accessing shared data
Shared variable x
+1 +1
read read
write write
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
The problem of accessing shared data can be generalized by considering shared resources
Mechanism
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time
This Mechanism Mutual exclusion
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
lock
The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock
lock is variale contain value 0 1
lock = 1 process entered Critical Section
lock = 0 no process is in Critical Section
The lock operates much like that of a door lock
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section
It now has to wait until it is allowed to enter the critical section
while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section
hellipcritical sectionhellip leave critical section
lock = 0
lock spin lock
Mechanism Busy waiting
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
In some case int may be possile to deschedule the process from the processor and schedule another process
Overhead in saving and reading process information
Necessary to choose the best or highest-priority process to enterthe critical section
Process 1 Process 2
while (lock == 1) do_nothinglock = 1
lock = 0
Critical Section
while (lock == 1) do_nothing
lock = 1
Lock = 0
Critical Section
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Pthread Lock Routines
Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)
Resolve synchronous problem of threads
Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự
Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread
Note
Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom
includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex
Khai baacuteo biến pthread_mutex_t mutex
Khởi động trị ban đầu
mutex = PTHREAD_MUTEX_INITIALIZER
mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER
Khởi động bằng hagravem
int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom
Caacutec hagravem quan trọng
int pthread_mutex_lock( pthread_mutex_t mutex)
int pthread_mutex_unlock( pthread_mutex_t mutex)
int pthread_mutex_trylock( pthread_mutex_t mutex)
int pthread_mutex_destroy( pthread_mutex_t mutex)
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Dependency analysis
One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the
dependencies in a program is call dependency analysis
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Example
Ex1 forall(i=0ilt5i++) a[i] = 0
All instances can be executed simultaneously
Ex2 forall (I = 2 ilt6 i++)
x = I ndash 2I + iI a[i] = a[x]
In this case it is not at all obvious whether
different instances of the body can be executed simultaneously
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
LOGO Bernsteinrsquos condition
-I(n) is the set of memory locations read by process P(n)
-O(m) is the set of memory locations altered by process P(m)
If the three conditions are all satisfiedthe two processes can be
executed concurrently
I1 O2 = I2 O1 =
O1 O2 =
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Dependency analysis
Example 1 Suppose the two statements are (in C) a = x + y
b = x + z------------------------------------------
I1(xy) I2(xz) O1(a) O2(b) -----------------------
I1 O2 = I2 O1 = O1 O2 = - two statements can be executed
simultaneously
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Dependency analysis
Example 2 Suppose the two statements are (in C) a = x + y
b = a + b------------------------------------------
I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed
simultaneously
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
Language Contructs for Parallelism
Shared dataIn a parallelism programming language surportingshared memory variable might be declared as
shared int xWith C++
Int gobal x
par Contruct
Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct
par s1 s2 hellip sn
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
The keyword par indicates that statements in body areto be executed concurrently
Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently
par proc1 proc2 hellip procn
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
forall Contruct
Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)
forall (i = 0 I lt n i++) s1 s2 hellip sm
Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i
forall (i = 0 I lt 5 i++) a[i] = 0
Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
LOGO
Share DATA in systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Cache coherence protocolsIn the update policy copies of data in all
caches are updated at the time one copy is altered
In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
False sharingThe key characteristic used is that
caches are organized in blocks of contiguous locations
Different parts of a block required by different processors but not the same bytes
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Share Data in Systems with Caches
Solution for false sharingCompiler to alter the layout of the data
stored in the main memory separating data only altered by one processor into different blocks
The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Different types of memory architecture
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Central memory versus distributed memory
A parallel computer has either a central memory or distributed memory architecture
Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture
Central memory systems are also known as UMA (uniform memory access) systems
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Central memory versus distributed memory
In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time
UMA systems have two types PVP the parallel vector processor or
also called vector supercomputer SMP the symmetric multiprocessor
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Distributed-Memory architecture
A distributed memory computer contains multiple nodes each having one or more processors and a local memory
Memories in other nodes are called remote memories
Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Distributed-Memory architecture
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NORMA machine The node memories have separate
address spaces A node cant directly access remote
memory The only way to access remote data
is by passing messages
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each
node has a set of node-level register called E-registers
Other NCC-NUMA systems may allow loading a remote value directly into a processor register
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Distributed-Memory architecture
In a COMA machine All local memories are structured as
caches (called COMA caches) A cache has much larger capacity than the
level-2 cache or the remote cache of a node
COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
NCC-NUMA versus CC-NUMA COMA
An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware
It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
CC-NUMA versus COMA
CC-NUMA Main memory consists of all the
local memoriesCOMA
Main memory consists of all the COMA caches
All the complexity make a COMA system more expensive to implement than a NUMA machine
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
wwwthemegallerycom Company Logo
Characteristics of five Distributed-Memory
architecture
LOGO
LOGO