seminar shared memory programming

88
LOGO Seminar Shared memory Programming Week 7 _ 2 Nguyễn Thị Xuân Mai 50501620 Phạm Thị Trúc Linh 50501491 Vũ Thị Mai Phương 50502194 Nguyễn Minh Quân 50502771 Phí Văn Tuấn 50503347

Upload: justin-rojas

Post on 01-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Seminar Shared memory Programming. Week 7 _ 2. Nguyễn Thị Xuân Mai50501620 Phạm Thị Trúc Linh50501491 Vũ Thị Mai Phương50502194 Nguyễn Minh Quân50502771 Phí Văn Tuấn50503347. Programming with shared memory. Creating Concurrent Processes. 2. Threads. 1. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Seminar Shared memory Programming

LOGO

Seminar

Shared memory Programming

Week 7 _ 2

Nguyễn Thị Xuacircn Mai 50501620Phạm Thị Truacutec Linh 50501491Vũ Thị Mai Phương 50502194Nguyễn Minh Quacircn 50502771Phiacute Văn Tuấn 50503347

Programming with shared memory

Shared memory multiprocessor1

Constructs for specifying parallelism2

Creating Concurrent Processes

Threads2

Sharing data3

Programming with shared memory

Creating Shared Data

Accessing Shared Data

Locks Deadlock

Semaphores Monitor Condition Variables

Language Constructs for Parallelism

Dependency Analysis4

Shared Data in system with caches

Shared Memory Multiprocessors

bull In a shared memory system any memory location can be accessible by any of the processors

bull A single address space exists meaning that each memory location is given a unique address within a single range of addresses

bull Shared-memory behavior is determined by both program order and memory access order

Shared memory multiprocessor

Shared Memory Multiprocessors

bull For a small number of processors a common architecture is the single bus architecture in which all processors and memory modules attach to the same set of wires(the bus)

Shared memory multiprocessor

Shared Memory Multiprocessors

Shared memory multiprocessor

ln

l5l4l3l2l1

Memory

l5 l5l4l3l2l1

l6J5

J4

J3

J2

J1

K5

K4

K3

K2

K1

K6

Programorder(PO)

PO1 PO2POn

Share Memory(A global memory oder)

Switch

(a) A uniprocessor system (b) A multiprocessor system

Constructs for Specifying Parallelism

Creating Concurrent Processes1

Threads2

Constructs for specifying Parallelism

Creating Concurrent Proceses

bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements

bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends

bull When JOIN statements have been reached processing continues in a sequential fashion

Constructs for specifying Parallelism

Creating Concurrent Proceses(cont)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Operating systems such as UNIX are based upon the notion of a process

bull On a single processor system the processor has to be time shared between processes switching from one process to another

bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete

bull On a multiprocessors there is an opportunity to execute process truly concurrently

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID

bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process

bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as

wait(statusp)delays caller until signal received or one of

itshellipchild process terminates or stophellip

exit(status)terminates a process

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 2: Seminar Shared memory Programming

Programming with shared memory

Shared memory multiprocessor1

Constructs for specifying parallelism2

Creating Concurrent Processes

Threads2

Sharing data3

Programming with shared memory

Creating Shared Data

Accessing Shared Data

Locks Deadlock

Semaphores Monitor Condition Variables

Language Constructs for Parallelism

Dependency Analysis4

Shared Data in system with caches

Shared Memory Multiprocessors

bull In a shared memory system any memory location can be accessible by any of the processors

bull A single address space exists meaning that each memory location is given a unique address within a single range of addresses

bull Shared-memory behavior is determined by both program order and memory access order

Shared memory multiprocessor

Shared Memory Multiprocessors

bull For a small number of processors a common architecture is the single bus architecture in which all processors and memory modules attach to the same set of wires(the bus)

Shared memory multiprocessor

Shared Memory Multiprocessors

Shared memory multiprocessor

ln

l5l4l3l2l1

Memory

l5 l5l4l3l2l1

l6J5

J4

J3

J2

J1

K5

K4

K3

K2

K1

K6

Programorder(PO)

PO1 PO2POn

Share Memory(A global memory oder)

Switch

(a) A uniprocessor system (b) A multiprocessor system

Constructs for Specifying Parallelism

Creating Concurrent Processes1

Threads2

Constructs for specifying Parallelism

Creating Concurrent Proceses

bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements

bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends

bull When JOIN statements have been reached processing continues in a sequential fashion

Constructs for specifying Parallelism

Creating Concurrent Proceses(cont)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Operating systems such as UNIX are based upon the notion of a process

bull On a single processor system the processor has to be time shared between processes switching from one process to another

bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete

bull On a multiprocessors there is an opportunity to execute process truly concurrently

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID

bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process

bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as

wait(statusp)delays caller until signal received or one of

itshellipchild process terminates or stophellip

exit(status)terminates a process

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 3: Seminar Shared memory Programming

Sharing data3

Programming with shared memory

Creating Shared Data

Accessing Shared Data

Locks Deadlock

Semaphores Monitor Condition Variables

Language Constructs for Parallelism

Dependency Analysis4

Shared Data in system with caches

Shared Memory Multiprocessors

bull In a shared memory system any memory location can be accessible by any of the processors

bull A single address space exists meaning that each memory location is given a unique address within a single range of addresses

bull Shared-memory behavior is determined by both program order and memory access order

Shared memory multiprocessor

Shared Memory Multiprocessors

bull For a small number of processors a common architecture is the single bus architecture in which all processors and memory modules attach to the same set of wires(the bus)

Shared memory multiprocessor

Shared Memory Multiprocessors

Shared memory multiprocessor

ln

l5l4l3l2l1

Memory

l5 l5l4l3l2l1

l6J5

J4

J3

J2

J1

K5

K4

K3

K2

K1

K6

Programorder(PO)

PO1 PO2POn

Share Memory(A global memory oder)

Switch

(a) A uniprocessor system (b) A multiprocessor system

Constructs for Specifying Parallelism

Creating Concurrent Processes1

Threads2

Constructs for specifying Parallelism

Creating Concurrent Proceses

bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements

bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends

bull When JOIN statements have been reached processing continues in a sequential fashion

Constructs for specifying Parallelism

Creating Concurrent Proceses(cont)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Operating systems such as UNIX are based upon the notion of a process

bull On a single processor system the processor has to be time shared between processes switching from one process to another

bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete

bull On a multiprocessors there is an opportunity to execute process truly concurrently

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID

bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process

bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as

wait(statusp)delays caller until signal received or one of

itshellipchild process terminates or stophellip

exit(status)terminates a process

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 4: Seminar Shared memory Programming

Shared Memory Multiprocessors

bull In a shared memory system any memory location can be accessible by any of the processors

bull A single address space exists meaning that each memory location is given a unique address within a single range of addresses

bull Shared-memory behavior is determined by both program order and memory access order

Shared memory multiprocessor

Shared Memory Multiprocessors

bull For a small number of processors a common architecture is the single bus architecture in which all processors and memory modules attach to the same set of wires(the bus)

Shared memory multiprocessor

Shared Memory Multiprocessors

Shared memory multiprocessor

ln

l5l4l3l2l1

Memory

l5 l5l4l3l2l1

l6J5

J4

J3

J2

J1

K5

K4

K3

K2

K1

K6

Programorder(PO)

PO1 PO2POn

Share Memory(A global memory oder)

Switch

(a) A uniprocessor system (b) A multiprocessor system

Constructs for Specifying Parallelism

Creating Concurrent Processes1

Threads2

Constructs for specifying Parallelism

Creating Concurrent Proceses

bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements

bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends

bull When JOIN statements have been reached processing continues in a sequential fashion

Constructs for specifying Parallelism

Creating Concurrent Proceses(cont)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Operating systems such as UNIX are based upon the notion of a process

bull On a single processor system the processor has to be time shared between processes switching from one process to another

bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete

bull On a multiprocessors there is an opportunity to execute process truly concurrently

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID

bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process

bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as

wait(statusp)delays caller until signal received or one of

itshellipchild process terminates or stophellip

exit(status)terminates a process

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 5: Seminar Shared memory Programming

Shared Memory Multiprocessors

bull For a small number of processors a common architecture is the single bus architecture in which all processors and memory modules attach to the same set of wires(the bus)

Shared memory multiprocessor

Shared Memory Multiprocessors

Shared memory multiprocessor

ln

l5l4l3l2l1

Memory

l5 l5l4l3l2l1

l6J5

J4

J3

J2

J1

K5

K4

K3

K2

K1

K6

Programorder(PO)

PO1 PO2POn

Share Memory(A global memory oder)

Switch

(a) A uniprocessor system (b) A multiprocessor system

Constructs for Specifying Parallelism

Creating Concurrent Processes1

Threads2

Constructs for specifying Parallelism

Creating Concurrent Proceses

bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements

bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends

bull When JOIN statements have been reached processing continues in a sequential fashion

Constructs for specifying Parallelism

Creating Concurrent Proceses(cont)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Operating systems such as UNIX are based upon the notion of a process

bull On a single processor system the processor has to be time shared between processes switching from one process to another

bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete

bull On a multiprocessors there is an opportunity to execute process truly concurrently

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID

bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process

bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as

wait(statusp)delays caller until signal received or one of

itshellipchild process terminates or stophellip

exit(status)terminates a process

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 6: Seminar Shared memory Programming

Shared Memory Multiprocessors

Shared memory multiprocessor

ln

l5l4l3l2l1

Memory

l5 l5l4l3l2l1

l6J5

J4

J3

J2

J1

K5

K4

K3

K2

K1

K6

Programorder(PO)

PO1 PO2POn

Share Memory(A global memory oder)

Switch

(a) A uniprocessor system (b) A multiprocessor system

Constructs for Specifying Parallelism

Creating Concurrent Processes1

Threads2

Constructs for specifying Parallelism

Creating Concurrent Proceses

bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements

bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends

bull When JOIN statements have been reached processing continues in a sequential fashion

Constructs for specifying Parallelism

Creating Concurrent Proceses(cont)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Operating systems such as UNIX are based upon the notion of a process

bull On a single processor system the processor has to be time shared between processes switching from one process to another

bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete

bull On a multiprocessors there is an opportunity to execute process truly concurrently

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID

bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process

bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as

wait(statusp)delays caller until signal received or one of

itshellipchild process terminates or stophellip

exit(status)terminates a process

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 7: Seminar Shared memory Programming

Constructs for Specifying Parallelism

Creating Concurrent Processes1

Threads2

Constructs for specifying Parallelism

Creating Concurrent Proceses

bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements

bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends

bull When JOIN statements have been reached processing continues in a sequential fashion

Constructs for specifying Parallelism

Creating Concurrent Proceses(cont)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Operating systems such as UNIX are based upon the notion of a process

bull On a single processor system the processor has to be time shared between processes switching from one process to another

bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete

bull On a multiprocessors there is an opportunity to execute process truly concurrently

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID

bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process

bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as

wait(statusp)delays caller until signal received or one of

itshellipchild process terminates or stophellip

exit(status)terminates a process

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 8: Seminar Shared memory Programming

Creating Concurrent Proceses

bull A structure to specifying concurrent processes is the FORK ndash JOIN group statements

bull A FORK statement generates one new path for a concurrent process and the concurrent process use JOIN statements at their ends

bull When JOIN statements have been reached processing continues in a sequential fashion

Constructs for specifying Parallelism

Creating Concurrent Proceses(cont)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Operating systems such as UNIX are based upon the notion of a process

bull On a single processor system the processor has to be time shared between processes switching from one process to another

bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete

bull On a multiprocessors there is an opportunity to execute process truly concurrently

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID

bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process

bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as

wait(statusp)delays caller until signal received or one of

itshellipchild process terminates or stophellip

exit(status)terminates a process

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 9: Seminar Shared memory Programming

Creating Concurrent Proceses(cont)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Operating systems such as UNIX are based upon the notion of a process

bull On a single processor system the processor has to be time shared between processes switching from one process to another

bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete

bull On a multiprocessors there is an opportunity to execute process truly concurrently

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID

bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process

bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as

wait(statusp)delays caller until signal received or one of

itshellipchild process terminates or stophellip

exit(status)terminates a process

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 10: Seminar Shared memory Programming

UNIX Heavyweight Processes

bull Operating systems such as UNIX are based upon the notion of a process

bull On a single processor system the processor has to be time shared between processes switching from one process to another

bull Time sharing also offer the opportunity to deschedule processes that are blocked from proceeding for some reason such as waiting for an IO operation to complete

bull On a multiprocessors there is an opportunity to execute process truly concurrently

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID

bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process

bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as

wait(statusp)delays caller until signal received or one of

itshellipchild process terminates or stophellip

exit(status)terminates a process

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 11: Seminar Shared memory Programming

UNIX Heavyweight Processes

bull The UNIX system call fork() creates a new process The new process (child process) is an exact copy of the calling process except that it has a unique process ID

bull On success fork() returns 0 to the child process ang returns the process ID of the child process to the parent process

bull Process are ldquojoinedrdquo with the system calls wait() and exit() defined as

wait(statusp)delays caller until signal received or one of

itshellipchild process terminates or stophellip

exit(status)terminates a process

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 12: Seminar Shared memory Programming

UNIX Heavyweight Processes

bull Hence a single child process can be created by

pid = fork() fork

hellip Code to be excuted by both child and parenthellip

if (pid == 0) exit(0)else wait(0) join

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 13: Seminar Shared memory Programming

UNIX Heavyweight Processes

bull If the child is to execute different code we could use

pid = fork() if (pid == 0)

hellip code to be executed by slave hellip else

hellip code to be executed by parent hellipif (pid == 0) exit (0) else wait (0)

Constructs for specifying Parallelism

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 14: Seminar Shared memory Programming

UNIX Heavyweight Processes

bull All variables in the original program are duplicated in each process becoming local variables for the process They are assigned the same values as the original variables initially

bull The parent will wait for the slave to finish if it reaches the ldquojoinrdquo point first if the slave reaches the ldquojoinrdquo point first it will terminate

Constructs for specifying Parallelism

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 15: Seminar Shared memory Programming

Threads

Thread a thread of execution is a fork of a

computer program into two or more concurrently running tasks

Thread mechanism Allow to share the same memory space amp

global variables

Constructs for specifying Parallelism

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 16: Seminar Shared memory Programming

Context Switching

Interaction

Address Space

State Infomation

Dependence bull processes are typically independent while threads exist as subsets of a process

bull processes carry considerable state information where multiple threads within a process share state as well as memory and other resources

bull processes have separate address spaces where threads share their address space

bullprocesses interact only through system-provided inter-process communication mechanisms while threads interact through shared variable or by message passing

bullContext switching between threads in the same process is typically faster than context switching between processes

Processes amp Threads

Interupt Routines

File

IP

Code Heap

Stack

IPStackInterupt Routines

File

IP

Code Heap

Stack Thread

Process

Constructs for specifying Parallelism

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 17: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Multithreaded Processor Model

Analyze performance of system Latency L communication latency

experienced with remote memory access network delay cache-miss penalty delays caused by

contentions in split transactions

Number of threads N Number of thread that can be interleaved in a processor

Context of a thread =PCregister set required context status word hellip

Context switch overhead C time lost in performing context switch in a processor

Switch mechanism number of processor states needed to maintain active threads

Interval between context switches run length (cycles between context switch triggered by remote reference)

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 18: Seminar Shared memory Programming

Multithreaded Computation

Initial Scheduling overhead Thread Synchronization overhead

Thread of Parallel Computation

Variable

Computation

The concept of multithreading in MPP system

Processor efficiency Busy do useful work Context switch suspend current

context amp switch to another Idle when all availble context

suspended (blocked)

Efficient = Busy (busy + switching + idle)

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 19: Seminar Shared memory Programming

Abtract Processor Model

wwwthemegallerycom Company Logo

Multiple-context processor model with one thread per context

PC

PSW

PC

PSW

PC

PSW

ALU Local memory reference

Remote memory reference

Register Files

N Contexts

1 Thread context

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 20: Seminar Shared memory Programming

Context-switching policies

wwwthemegallerycom Company Logo

Switch on cache miss when encoutering a cache miss

Switch on every load switching on every load operation independent of whether it will cause a miss or not

Switch on every instruction switching on every instruction insependent of whether or not it is a load

Switch on block of instruction will improve the cache-hit ratio due to preservation of some locality amp also benefit singe-context performance

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 21: Seminar Shared memory Programming

Pthread Thread

History SUN Solaris Window NT hellip are examples the multithreaded operating systems allow users to employ threads in their programsbut each system is different

StandardIEEE POSIX 10031c

standard (1995)

Constructs for specifying Parallelism

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 22: Seminar Shared memory Programming

Executing a Pthread Thread

int pthread_create(pthread_t thread pthread_attr_t attr

void (start_routine)(void ) void arg) Return value of function1048698success create a new thread 0 arg contain thread ID1048698fail ltgt 0 (error signature is contain in the variable of errno) Argumentsthread a pointer of type pthread_t contain ID of new thread1048698attr contain initial attr for thread if attr = NULL Attr are init by default value1048698start_routine la reference to a function defined by user This function contain

execute code for new thread1048698arg a single argument is passed for start_routine

Pthread_t thread Hanndle of specia Pthread datatype

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 23: Seminar Shared memory Programming

Executing a Pthread Thread(cont)

pthread_exit(void status) Terminate amp destroy a thread

pthread_cancel() Thread is destroyed by another process

int pthread_join(pthread_t th void thread_return) pthread_join() force call thread suspend its execution amp wait until thread having

thread ID terminates Thread_return contain the return value (value of return statement or pthread_exit(hellip) statement)

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 24: Seminar Shared memory Programming

Detached ThreadThere are cases in which threads can

be terminated without needed of pthread_join

Detached Thread

When Detached Thread teminate they are destroyed amp their resource released

=gt More efficient

Main program

Pthread_create()

Termination

Thread

Pthread_create()

Pthread_create() Termination

Termination

Thread

Thread

Constructs for specifying Parallelism

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 25: Seminar Shared memory Programming

Thread Pools

A master thread could control a collection of slave threads A work pool of threads could be formed Threads can communicate through shared location or as we shall see = using signals

Constructs for specifying Parallelism

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 26: Seminar Shared memory Programming

Thread ndash safe routines

System calls or library routines are called thread safe if they can be called from multiple threads simultaneously and always produce correct results

Thread ndash safeness an applications ability to execute multiple threads simultaneously without clobbering shared data or creating race conditions

Constructs for specifying Parallelism

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 27: Seminar Shared memory Programming

Thread ndash safe routines (cont)

Suppose that your application creates several threads each of which makes a call to the same library routine

This library routine accessesmodifies a global structure or location in memory

As each thread calls this routine it is possible that they may try to modify this global structurememory location at the same time

If the routine does not employ some sort of synchronization constructs to prevent data corruption then it is not thread-safe

fe

Constructs for specifying Parallelism

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 28: Seminar Shared memory Programming

Sharing DataCreating Shared Data1

Accessing Shared Data2

Locks

Deadlock

Semaphores

Deadlock

Condition Variables

Language Constructs for Parallelism3

Dependency Analysis4

Shared Data in system with caches5

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 29: Seminar Shared memory Programming

Sharing Data

Text in here

Text in here

Conditions for Deadlock

1 Mutual exclusion bull Resource can not be shared bull Requests are delayed until resource is released

2 Hold-and-wait bull Thread holds one resource while waits for another

3 No preemption bull Resources are released voluntarily after completion

4 Circular wait bull Circular dependencies exist in ldquowaits-forrdquo or ldquoresource-allocationrdquo graphs

ALL four conditions MUST hold

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 30: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Handing Deadlock

Text

Text

Text

Txt

Deadlock prevention Deadlock avoidance

Deadlock detection and recovery

Ignore

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 31: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Deadlock

88b n-processes deadlock 88a 2 processes dealock

R1R2hellipRn lagrave tagravei nguyecircn P1P2hellipPn lagrave quaacute trigravenh

Example

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 32: Seminar Shared memory Programming

LOGO

Semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 33: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

sEMaPHoRE

A positive integer operated upon by 2 operations P amp V

The value is the number of the units of the resource which are free

A binary semaphore has value 0 or 1A general semaphore can take on

positive values other than 0 and 1

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 34: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

sEMaPHoRE

P amp V operations are performed indivisibly

P(s) waits until s is greater than 0 then decrements s by 1 allows the process to continue

V(s) increments s by 1 to release one of the waiting processes (if any)

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 35: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

sEMaPHoRE

The first process reach its P(s) operation or to be accepted will set the semaphore to 0

Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 36: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

sEMaPHoRE

When the process reaches its V(s) operation it sets the semaphore s to 1 and one of the processes waiting is allowed to proceed into the critical section

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 37: Seminar Shared memory Programming

Monitor

Disadvange of Semaphorebull Although semaphore provide a

convenient and effective mechanism for process synchronization using them incorrectly can result in timing error that are difficult to detect (deadlock) since these errors happen only if some particular execution sequences take place and these sequences do not always occur

bull Using incorrectly may be caused by an honest programming error or an uncooperative programmer

Shared memory multiprocessor

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 38: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipSignal(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra vi phạm điều kiện loại trừ lẫn nhau

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 39: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Monitor

Right codehellipWait(mutex)critical sectionSignal(mutex)hellip

Example wrong Semaphore

Wrong codehellipwait(mutex)critical codeWait(mutex)hellipĐoạn matilde sai nagravey gacircy ra bế tắc

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 40: Seminar Shared memory Programming

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull If programmer omits the wait() or the signal() in critical section or both either mutual exclusion is violated or a deadlock will occur

bull Both processes are simultaneously active will cause a deadlock

Process P1hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Process P2hellipWait(S)Wait(Q)critical sectionSignal(S)Signal(Q)

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 41: Seminar Shared memory Programming

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull To deal with such errors researches have developed high-level language constructs one fundamental high-level synchronization construct ndash the monitor type

bull Monitor is an approach to synchronize operations on computer when use shared source Monitor includes A suit of procedure that provides the only

method to access a shared resource Key eliminate together Variables correlative shared source Some unchanged assume to avoid events

conflict

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 42: Seminar Shared memory Programming

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A type or abstract data type encapsulates private data with public method to operate on that data

bull Monitor type presents a set of programmer-defined operations that are provided mutual exclusion within monitor

bull The monitor type also contains the declaration of variables whose values define the state of an instance of that type along with the bodies of procedures or function that operate on those variables

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 43: Seminar Shared memory Programming

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Monitor

bull A structure of monitor typemonitor monitor_name shared variable declarationprocedure P1()

procedure P2()

procedure Pn()

initialization_code ()

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 44: Seminar Shared memory Programming

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 45: Seminar Shared memory Programming

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Usage Monitor

bull The monitor construct ensure that only one process at a time can be active within the monitor Consequently the programmer does not need to code this synchronization constraint explicitly

bull Monitor as defined so far is not sufficiently powerful for modeling some synchronization scheme Need to define additional some machanisms ldquotailor moderdquo about synchronization

bull Some addiontional synchronization ldquotailor moderdquo use conditional construct

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 46: Seminar Shared memory Programming

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Conditional type

bull Declare conditional xybull Only operations that can be invoke on a

conditional variable are wait() and signal()xwait() the process invoking this operation is suspended until

another process invokesxsignal()the process invoking this operation resumes exatly one

suspended processbull if no process is suspended xsignal() has

no effect

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 47: Seminar Shared memory Programming

Khoa Khoa Học amp Kĩ thuật maacutey tiacutenh - Đại học Baacutech Khoa TpHCM

Structure Monitor conditional type

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 48: Seminar Shared memory Programming

Condition variables

Condition variablesAllow threads to synchronize based

upon the actual value of data Without condition variables the

programmer would need to have threads continually polling (possibly in a critical section) to check if the condition is met This can be very time consuming amp unproductive execise since the thread would be continuously busy in this activity

Always used in conjunction with a mutex lock

Sharing Data

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 49: Seminar Shared memory Programming

Pthread Condition Variables

int pthread_cond_signal(pthread_cond_t cond) call unblocks at least one of the threads that are blocked on the specified

condition variable cond (if any threads are blocked on cond)

int pthread_cond_init(pthread_cond_t cond const pthread_condattr_t attr)

initialises the condition variable referenced by cond with attributes referenced by attr If attr is NULL the default condition variable attributes are used the effect is the same as passing the address of a default condition variable attributes object Upon successful initialisation the state of the condition variable becomes initialised

Pthread_cond_t cond Declare condition variable

Sharing Data

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 50: Seminar Shared memory Programming

Pthread Condition Variables(cont)

int pthread_cond_broadcast(pthread_cond_t cond) call unblocks all threads currently blocked on the specified condition variable

cond

int pthread_cond_destroy(pthread_cond_t cond) destroys the given condition variable specified by cond the object becomes in

effect uninitialised An implementation may cause pthread_cond_destroy() to set the object referenced by cond to an invalid value A destroyed condition variable object can be re-initialised using pthread_cond_init() the results of otherwise referencing the object after it has been destroyed are undefined

int pthread_cond_wait(pthread_cond_t cond pthread_mutex_t mutex)

int pthread_cond_timedwait(pthread_cond_t cond pthread_mutex_t mutex const struct timespec abstime)

block on a condition variable the second one alow to appoint timeout

Sharing Data

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 51: Seminar Shared memory Programming

Sequence for using condition variable - example

This simple example code demonstrates the use of several Pthread condition variable routines The main routine creates three threads Two of the threads perform work and update a count variable The third thread waits until the count variable reaches a specified value

Sharing Data

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 52: Seminar Shared memory Programming

Sequence for using condition variable - example (cont)

include ltpthreadhgtint count = 0 global var DECLAREpthread_mutex_t count_mutex pthread_cond_t count_threshold_cv hellippthread_mutex_init(ampcount_mutex NULL) INITIALIZEpthread_cond_init (ampcount_threshold_cv NULL) hellippthread_create (hellip)CREATE THREAD TO DO WORK

main

Thread 23

void inc_count(void t) hellip for (i=0 iltTCOUNT i++) pthread_mutex_lock(ampcount_mutex) count++ if (count == COUNT_LIMIT) pthread_cond_signal( hellip )hellip pthread_mutex_unlock(ampcount_mutex) Do some work so threads can alternate on mutex lock sleep(1)pthread_exit(NULL)

void watch_count(void t) pthread_mutex_lock(ampcount_mutex)

if (countltCOUNT_LIMIT)

pthread_cond_wait (hellip)

count += 125

pthread_mutex_unlock(hellip) pthread_exit(NULL)

Thread 1

Sharing Data

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 53: Seminar Shared memory Programming

The key aspect of shared memory progamming is that shared memory provides the possibility of creating variables ang data structures that can be accessed directly by every processor rather than having to pass the data in messages as in message ndash passing enviroment

Creating Shared Data

Each process has its own virtual address space within the virtualmemory management system

Shared Memory System Call allow processes to attach a segmentof physical memory to their virtual memory space

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 54: Seminar Shared memory Programming

Creating Shared Data (tt)

shmget ()

Create shared memory segment

Return value is shared memory ID

shmat()

Attach shared segment to the data segmentof the calling process

Return the starting address of the data segment

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 55: Seminar Shared memory Programming

Accessing Shared Data

Accessing Shared Data needs careful control if the data is everaltered by a process

Problem CONFLICT

o Reading the variable by different process does not cause CONFLICT

o But writing new value CONFLICT

EX Consider two processes each of which is to add 1 two a shareddata item x

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 56: Seminar Shared memory Programming

Instruction Process 1 Process 2

x = x + 1 read x read x

Compute x + 1 Compute x + 1

write to x write to xtime

Conflict in accessing shared data

Shared variable x

+1 +1

read read

write write

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 57: Seminar Shared memory Programming

The problem of accessing shared data can be generalized by considering shared resources

Mechanism

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time

This Mechanism Mutual exclusion

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 58: Seminar Shared memory Programming

lock

The simplest mechanism for ensuring mutual exclution of critical section is by the use of a lock

lock is variale contain value 0 1

lock = 1 process entered Critical Section

lock = 0 no process is in Critical Section

The lock operates much like that of a door lock

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 59: Seminar Shared memory Programming

Suppose that a process reaches a lock that is set indicating that the process is excluded from the critical section

It now has to wait until it is allowed to enter the critical section

while (lock == 1) do_nothing no operation in while looplock = 1 enter critical section

hellipcritical sectionhellip leave critical section

lock = 0

lock spin lock

Mechanism Busy waiting

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 60: Seminar Shared memory Programming

In some case int may be possile to deschedule the process from the processor and schedule another process

Overhead in saving and reading process information

Necessary to choose the best or highest-priority process to enterthe critical section

Process 1 Process 2

while (lock == 1) do_nothinglock = 1

lock = 0

Critical Section

while (lock == 1) do_nothing

lock = 1

Lock = 0

Critical Section

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 61: Seminar Shared memory Programming

Pthread Lock Routines

Locks are implemented in Pthreads with what are called mutual exclusion lock variable (MUTEX)

Resolve synchronous problem of threads

Mutex dugraveng để chia sẽ tagravei nguyecircn cho caacutec threads theo thứ tự

Cung cấp caacutech thức loại trừ tương hổ giữa caacutec thread

Note

Mutex chỉ sử dụng để đồng bộ caacutec threads trong cugraveng1 proccess khocircng thể đồng bộ giữa caacutec thread khaacutecproccess với nhau

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 62: Seminar Shared memory Programming

wwwthemegallerycom

includeltpthreadhgt thư viện chứa caacutec hagravem dugraveng cho Mutex

Khai baacuteo biến pthread_mutex_t mutex

Khởi động trị ban đầu

mutex = PTHREAD_MUTEX_INITIALIZER

mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER

Khởi động bằng hagravem

int pthread_mutex_init(pthread_mutex_t mutexconst pthread_mutexattr_t mutexattr )

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 63: Seminar Shared memory Programming

wwwthemegallerycom

Caacutec hagravem quan trọng

int pthread_mutex_lock( pthread_mutex_t mutex)

int pthread_mutex_unlock( pthread_mutex_t mutex)

int pthread_mutex_trylock( pthread_mutex_t mutex)

int pthread_mutex_destroy( pthread_mutex_t mutex)

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 64: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Dependency analysis

One of the key issues in all parallel programming is to identify which processes could be executed together Processes cannot be executed together if there is some dependence between them that requires the processes to be executed in a sequential orderThe process of finding the

dependencies in a program is call dependency analysis

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 65: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Example

Ex1 forall(i=0ilt5i++) a[i] = 0

All instances can be executed simultaneously

Ex2 forall (I = 2 ilt6 i++)

x = I ndash 2I + iI a[i] = a[x]

In this case it is not at all obvious whether

different instances of the body can be executed simultaneously

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 66: Seminar Shared memory Programming

LOGO Bernsteinrsquos condition

-I(n) is the set of memory locations read by process P(n)

-O(m) is the set of memory locations altered by process P(m)

If the three conditions are all satisfiedthe two processes can be

executed concurrently

I1 O2 = I2 O1 =

O1 O2 =

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 67: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Dependency analysis

Example 1 Suppose the two statements are (in C) a = x + y

b = x + z------------------------------------------

I1(xy) I2(xz) O1(a) O2(b) -----------------------

I1 O2 = I2 O1 = O1 O2 = - two statements can be executed

simultaneously

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 68: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Dependency analysis

Example 2 Suppose the two statements are (in C) a = x + y

b = a + b------------------------------------------

I1(xy) I2(ab) O1(a) O2(b) ----------------------- I2 O1 0 - two statements cannot be executed

simultaneously

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 69: Seminar Shared memory Programming

Language Contructs for Parallelism

Shared dataIn a parallelism programming language surportingshared memory variable might be declared as

shared int xWith C++

Int gobal x

par Contruct

Parallel languages offer the possibility of thespecifying concurrent statement as in the parContruct

par s1 s2 hellip sn

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 70: Seminar Shared memory Programming

The keyword par indicates that statements in body areto be executed concurrently

Multiple concurrent processes or threads could bespecified by listing the routines that are to be executedconcurrently

par proc1 proc2 hellip procn

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 71: Seminar Shared memory Programming

forall Contruct

Sometimes multiple similar processes need to bestarted together This can be optained with the forallConstruct ( or parfor contruct)

forall (i = 0 I lt n i++) s1 s2 hellip sm

Which generates n processes each consisting ofthe statements forming the body of the for loop s1 s2 hellip smEach process use different value of i

forall (i = 0 I lt 5 i++) a[i] = 0

Clear a[0] a[1] a[2] a[3] a[4] to zero concurrently

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 72: Seminar Shared memory Programming

LOGO

Share DATA in systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 73: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Cache coherence protocolsIn the update policy copies of data in all

caches are updated at the time one copy is altered

In the invalidate policy when one copy of data is altered the same data in any other cache is invalidated These copies are only updated when the associated processor makes reference for it

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 74: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

False sharingThe key characteristic used is that

caches are organized in blocks of contiguous locations

Different parts of a block required by different processors but not the same bytes

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 75: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 76: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Share Data in Systems with Caches

Solution for false sharingCompiler to alter the layout of the data

stored in the main memory separating data only altered by one processor into different blocks

The only way to avoid false sharing would be to place each element in a different block which would create significant wastage of storage for a large array

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 77: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Different types of memory architecture

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 78: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Central memory versus distributed memory

A parallel computer has either a central memory or distributed memory architecture

Distributed memory systems include NUMA (non uniform memory access) and NORMA (no-remote memory access) architecture

Central memory systems are also known as UMA (uniform memory access) systems

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 79: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Central memory versus distributed memory

In a UMA all memory locations are at an equal distance away from any processor and all memory accesses roughly take the same amount of time

UMA systems have two types PVP the parallel vector processor or

also called vector supercomputer SMP the symmetric multiprocessor

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 80: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Distributed-Memory architecture

A distributed memory computer contains multiple nodes each having one or more processors and a local memory

Memories in other nodes are called remote memories

Types of distributed memory architecture NORMA NCC-NUMA CC-NUMA and COMA

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 81: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Distributed-Memory architecture

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 82: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NORMA machine The node memories have separate

address spaces A node cant directly access remote

memory The only way to access remote data

is by passing messages

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 83: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a NCC-NUMA machine A typical is Cray T3E Besides the local memory each

node has a set of node-level register called E-registers

Other NCC-NUMA systems may allow loading a remote value directly into a processor register

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 84: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Distributed-Memory architecture

In a COMA machine All local memories are structured as

caches (called COMA caches) A cache has much larger capacity than the

level-2 cache or the remote cache of a node

COMA is the only architecture that provides hardware support for replicating the same block in multiple local memory

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 85: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

NCC-NUMA versus CC-NUMA COMA

An NCC-NUMA system doesnt have hardware support for cache coherency but CC-NUMA and COMA provide cache coherency support by hardware

It is easier to build a scalable NCC-NUMA system than either a CC-NUMA or a COMA system

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 86: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

CC-NUMA versus COMA

CC-NUMA Main memory consists of all the

local memoriesCOMA

Main memory consists of all the COMA caches

All the complexity make a COMA system more expensive to implement than a NUMA machine

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 87: Seminar Shared memory Programming

wwwthemegallerycom Company Logo

Characteristics of five Distributed-Memory

architecture

LOGO

Page 88: Seminar Shared memory Programming

LOGO