posix threads: a first step toward parallel...

Post on 25-Aug-2020

49 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

POSIX Threads: a first step toward parallel programming

George Bosilcabosilca@icl.utk.edu

Process vs. Thread• A process is a collection of

virtual memory space, code, data, and system resources.

• A thread (lightweight process) is code that is to be serially executed within a process.

• A process can have several threads.

Threads executing the same block of code maintain separate stacks. Each thread in a process shares that

process's global variables and resources.

Possible to create more efficient applications ?

ProcessState:register,SP,PC,…

CodeSegment

DataSegment

Heap

Stack

Process

Code

Data

Process

State

Stack

ThreadLocalStorage

ThreadState

Stack

ThreadLocalStorage

Thread

HardwareThreads

• Hardwarecontrolswitchingbetweenthreadstohidelatencies(memoryaccesses/operations)– Differentswitchingpolicies:cachemiss,aftereachoperation

– Hardwaremaintainindependentstateforeachthread(registers)

– Possibletoswitchingthreadsinonecycle(TERAmachine)

Terminology• LightweightProcess(LWP):orkernelthread• X-to-Ymodel:themappingbetweentheLWPandtheuserthreads

(1:X– Unix&co.,X:1– Userlevelthreads,X:Y – Windows7).• ContentionScope:howthreadscompeteforsystemresources• Thread-safeaprogramthatprotectstheshareddataforitsthreads

(mutualexclusion)• Reentrantcode:aprogramthatcanhavemorethanonethread

executingconcurrently.• Async-safemeansthatafunctionisreentrantwhilehandlinga

signal(i.e.canbecalledfromasignalhandler).• Concurrencyvs.Parallelism- Theyarenotthesame!Parallelism

impliessimultaneousrunningofcode(whichisnotpossible,inthestrictsense,onuniprocessormachines)whileconcurrencyimpliesthatmanytaskscanruninanyorderandpossiblyinparallel.

Threadvs.Process• Threadssharetheaddressspaceoftheprocessthatcreatedit;

processeshavetheirownaddressspace.• Threadshavedirectaccesstothedatasegmentofitsprocess;

processeshavetheirowncopyofthedatasegmentoftheparentprocess.

• Threadscandirectlycommunicatewithotherthreadsofitsprocess;processesmustuseinterprocess communicationtocommunicatewithsiblingprocesses.

• Threadshavealmostnooverhead;processeshaveconsiderableoverhead.

• Newthreadsareeasilycreated;newprocessesrequireduplicationoftheparentprocess.

• Threadscanexerciseconsiderablecontroloverthreadsofthesameprocess;processescanonlyexercisecontroloverchildprocesses.

• Changestothemainthread(cancellation,prioritychange,etc.)mayaffectthebehavioroftheotherthreadsoftheprocess;changestotheparentprocessdoesnotaffectchildprocesses.

2.2.2TheClassicalThreadModel in ModernOperatingSystems3e byTanenbaum

Process vs. Thread• Multithreaded applications must avoid two

threading problems: deadlocks and races.• A deadlock occurs when each thread is waiting

for the other to do something.• A race condition occurs when one thread

finishes before another on which it depends, causing the former to use a bogus value because the latter has not yet supplied a valid one.

The key is synchronization

• Ordering memory accesses– Flush / memory barrier– Volatile (poor-man solution in C)

• Synchronization = gaining access to a shared resource.• Synchronization REQUIRE cooperation.

x = 1;barrier();y = 5;x = 2;

Thread1

y = 1;barrier();while(x==1);printf( “%d\n”, y );

Thread2

POSIX Thread

• What’s POSIX ?– Widely used UNIX specification– Most of the UNIX flavor operating systems

POSIX is the Portable Operating System Interface, the open operating interface standard accepted world-wide. It is produced by IEEE and recognized by ISO and ANSI.

Pthread APIPrefix Use

pthread_ Threadmanagement(create/destroy/cancel/join/exit)

pthread_attr_ Threadattributes

pthread_mutex_ Mutexes

pthread_mutexattr_ Mutexes attributes

pthread_cond_ Conditionvariables

pthread_condattr_ Conditionattributes

pthread_key_ Thread-specificdatakey(TLS)

pthread_rwlock_ Read/writelocks

pthread_barrier_ Synchronizationbarriers

ThreadManagement(create)

Attributes:Detachedorjoinablestate,Schedulinginheritance,Schedulingpolicy,Schedulingparameters,Schedulingcontentionscope,Stacksize,Stackaddress,Stackguard(overflow)size

int pthread_create (pthread_t *restrictthread,[OUT]threadidconst pthread_attr_t *restrictattr,[IN]attributesvoid*(*start_routine)(void*),[IN]threadfunctionvoid*restrictarg)[IN]argumentforthreadfunction

int pthread_exit (void*value_ptr)[OUT]Returntocaller/joinerint pthread_cancel (pthread_t thread)[IN]threadtobecancelledint pthread_attr_init (pthread_attr_t *attr)[OUT]attributestobeinitializedint pthread_attr_set*(pthread_attr_t *restrictattr,*)[IN]setattributed(state,stack)int pthread_attr_destroy (pthread_attr_t *attr)[IN]attributedtobedestroyed

Questions:- Oncecreatedwhatwillbethestatusofthethreadandhowitwillbescheduled

bytheOS?(usesched_setscheduler)- Whereitwillberun?(usesched_setaffinity orHWLOC)

ThreadManagement(join)

Joinblocksthecallingthreaduntilthetargetthreadterminates,andreturnsit’sthread_exit argument.Itisimpossible tojoinathreadinadetachedstate.Itisalsoimpossibletoreattachit

intpthread_join (pthread_t thread,void**value_ptr)[OUT]threadreturnvalue

int pthread_detach (pthread_t thread)int pthread_attr_setdetachstate (pthread_attr_t *attr, [IN/OUT]attribute

int detachstate)[IN]statetobesetint pthread_attr_getdetachstate (const pthread_attr_t *attr,

int *detachstate)[OUT]detachstatevalue

ThreadManagement(state)

ThePOSIXstandarddoesnotdictatethesizeofathread'sstack!

int pthread_attr_getstacksize (const pthread_attr_t *restrictattr,size_t *restrictstacksize)

int pthread_attr_setstacksize (pthread_attr_t *attr,size_t stacksize)int pthread_attr_getstackaddr (const pthread_attr_t *restrictattr,

void**restrictstackaddr)int pthread_attr_setstackaddr (pthread_attr_t *attr,void*stackaddr)pthread_t pthread_self (void)int pthread_equal (pthread_t t1,pthread_t t2)

Threading- example• NumericalsolutiontoLaplace’sequation

( )nji

nji

ni

nji

nji UUUUU 1,1,1,11

, 41

+−+−+ +++=

i,j+1

i,j-1

i+1,ji-1,j

for j = 1 to jmaxfor i = 1 to imaxUnew(i,j) = 0.25 * ( U(i-1,j) + U(i+1,j)

+ U(i,j-1) + U(i,j+1))end for

end for

Threading- example• Theapproachtomakeitparallelisbypartitioningthedata

Threading- example• Theapproachtomakeitparallelisbypartitioningthedata

Overlapping the data boundaries allow computation without communication for each superstep

On the communication step each processor update the corresponding columns on the remote processors.

Threading- examplefor j = 1 to jmaxfor i = 1 to imaxunew(i,j) = 0.25 * ( U(i-1,j) + U(i+1,j)

+ U(i,j-1) + U(i,j+1))end for

end for

MemoryConsistencyModelsMemoryLevel Size Response

CPUregisters ≈100B 0.5ns(1 cycles)

L1Cache 64KB– 1M 1ns(fewcycles)

L2Cache ≈1-30 MB 10ns(tensofcycles)

MainMemory ≈? GB 150ns(hundreds ofcycles)

HardDisk ≈? TB 10ms(thousandsofcycles)

NetworkStorage ≈ ?PB 100msto 1s(muchmore)

• Definingaconsistentmemorymodelsisdifficultandnotnecessarilyrequiredforcorrectness– Weakerdefinitionsthatareeasiertoimplementandgoodenoughtoimplementpredictableanddeterministicapplications

Strict(atomic)consistency• Definition:anyreadtoamemorylocationXreturnsthevaluestoredbythemostrecentwriteoperationtoX– themostrecentcoversallcomputingunitsinthesystem

x =1[W(x)1]P1

?x… [R(x)?]P2

?x… [R(x)?]

P1

P2

W(x)1

R(x)1 R(x)1

P1

P2

W(x)1

R(x)0 R(x)1

P1

P2

W(x)1

R(x)0 R(x)1

SequentialConsistency• Definition:theresultofanyexecutionisthesameasifthereadsandwritesoccurredinsomeorder,andtheoperationsofeachindividualprocessorappearinthissequenceintheorderspecifiedbyitsprogram– Lamport ordering– expandingfromthesetsofreadsandwritesthat actually happenedtothesetsthat could havehappened,wecanreasonmoreeffectivelyabouttheprogram

x =1[W(x)1]P1

?x… [R(x)?]P2

?x… [R(x)?]

P1

P2

W(x)1

R(x)0 R(x)1

SequentialConsistencyx =1[W(x)1]

P1?x… [R(x)?]

P2

?x… [R(x)?]

x =2[W(x)2] ?x… [R(x)?]

?x… [R(x)?]

P3 P4

P1 W(x)1

P2 R(x)1 R(x)2

P3 W(x)2

P4 R(x)1 R(x)2

P1 W(x)1

P2 R(x)1 R(x)2

P3 W(x)2

P4 R(x)1 R(x)2

P1 W(x)1

P2 R(x)1 R(x)2

P3 W(x)2

P4 R(x)2 R(x)1

CachecoherencyisNOTsequentialconsistencybecauseseq.consistencyrequiresaglobally consistencyviewofmemoryoperationswhilecachecoherencyonlyrequiresthemlocally

CacheCoherenceP1 W(x)1

P2 R(x)0 R(x)2

P3 W(x)2

P4 R(y)0 R(y)1

W(y)2

W(y)1

R(y)0

R(x)1

R(y)1Can’thappenwithasnoopy-cacheschemebutitcanwithadirectory-basedcache

Typesofmemoryaccesses:• SharedAccess:wecanhavesharedaccesstovariables vs. privateaccess.But

thequestionswe'reconsideringareonlyrelevantforsharedaccesses.• Competing vs. Non-Competing:Ifwehavetwoaccessesfromdifferent

processors,andatleastoneisawrite,theyarecompetingaccesses.Theyareconsideredascompetingaccessesbecausetheresultdependsonwhichaccessoccursfirst(iftherearetwoaccesses,butthey'rebothreads,itdoesn'tmatterwhichisfirst).

• Synchronizing vs. Non-Synchronizing:Ordinarycompetingaccesses,suchasvariableaccesses,arenon-synchronizingaccesses.Accessesusedinsynchronizingtheprocessesare(ofcourse)synchronizingaccesses.

• Acquire vs. Release:Finally,wecandividesynchronizationaccessesintoaccessestoacquirelocks,andaccessestoreleaselocks.

WeakConsistency• Weakconsistencyresultsifweonlyconsidercompeting

accessesasbeingdividedintosynchronizingandnon-synchronizingaccesses,andrequirethefollowingproperties:– Accessestosynchronizationvariablesaresequentiallyconsistent.– Noaccesstoasynchronizationvariableisallowedtobeperformed

untilallpreviouswriteshavecompletedeverywhere.– Nodataaccess(readorwrite)isallowedtobeperformeduntilall

previousaccessestosynchronizationvariableshavebeenperformed.

P2 R(x)0 R(x)2 S R(x)2P3 R(x)1 R(x)2S

P1 W(x)1 W(x)2 S

Releaseconsistency• WeakConsistency(viasynchronization)requiresthatwhena

synchronizationoccurs,allprocessorsgloballyupdatememory–eachlocalchangemustbepropagatedtoallprocessorswithacopyofthesharedvariable,andeachprocessorneedtoobtainallchangesfromtheothers

• Releaseconsistencyconsiderfinergrainlocksofmemoryregions,andonlypropagatesthelockedmemory(asneeded)– Beforeanordinaryaccesstoasharedvariableisperformed,all

previousacquiresdonebytheprocessmusthavecompletedsuccessfully.

– Beforeareleaseisallowedtobeperformed,allpreviousreadsandwritesdonebytheprocessmusthavecompleted.

– Theacquireandreleaseaccessesmustbesequentiallyconsistent.

Mutual exclusion

• Simple lock primitive with 2 states: lock and unlock

• Only one thread can lock the mutex.• Several politics: FIFO, random,

recursive

lock

unlock…

…Thread 1

lock

unlock…

…Thread 2

lock

unlock…

…Thread 3

Active threads

Mutual exclusion

• Simple lock primitive with 2 states: lock and unlock

• Only one thread can lock the mutex.• Several politics: FIFO, random,

recursive

lock

unlock…

…Thread 1

lock

unlock…

…Thread 3

Active threads

lock

unlock…

…Thread 2

Mutual exclusion

• Simple lock primitive with 2 states: lock and unlock

• Only one thread can lock the mutex.• Several politics: FIFO, random,

recursive

lock

unlock…

…Thread 1

lock

unlock…

…Thread 3

Active threads

lock

unlock…

…Thread 2

Mutual exclusion

• Simple lock primitive with 2 states: lock and unlock

• Only one thread can lock the mutex.• Several politics: FIFO, random,

recursive

lock

unlock…

…Thread 1

lock

unlock…

…Thread 3

Active threads

lock

unlock…

…Thread 2

Mutual exclusion

• Simple lock primitive with 2 states: lock and unlock

• Only one thread can lock the mutex.• Several politics: FIFO, random,

recursive

lock

unlock…

…Thread 1

lock

unlock…

…Thread 3

Active threads

lock

unlock…

…Thread 2

Mutual exclusion

• Simple lock primitive with 2 states: lock and unlock

• Only one thread can lock the mutex.• Several politics: FIFO, random,

recursive

lock

unlock…

…Thread 1

lock

unlock…

…Thread 3

Active threads

lock

unlock…

…Thread 2

Mutual exclusion• Spin vs. sleep ?• What’s the desired lock grain ?

– Fine grain – spin mutex– Coarse grain – sleep mutex

• Spin mutex: use CPU cycles and increase the memory bandwidth, but when the mutex is unlock the thread continue his execution immediately.

Shared/Exclusive Locks• ReadWrite Mutual exclusion • Extension used by the reader/writer model• 4 states: write_lock, write_unlock, read_lock and

read_unlock.• multiple threads may hold a shared lock

simultaneously, but only one thread may hold an exclusive lock.

• if one thread holds an exclusive lock, no threads may hold a shared lock.

Shared/Exclusive Locks

rw_lock

rw_unlock…

…Writer 1

rw_lock

rw_unlock…

…Writer 2

rd_lock

rd_unlock…

…Reader 1

rd_lock

rd_unlock…

…Reader 2

rw_lock

rw_unlock…

…Writer 1

rw_lock

rw_unlock…

…Writer 2

rd_lock

rd_unlock…

…Reader 1

rd_lock

rd_unlock…

…Reader 2

Active threadSleeping thread

Legend

Step 1

Step 2

Shared/Exclusive Locks

Active threadSleeping thread

Legend

rw_lock

rw_unlock…

…Writer 1

rw_lock

rw_unlock…

…Writer 2

rd_lock

rd_unlock…

…Reader 1

rd_lock

rd_unlock…

…Reader 2

rw_lock

rw_unlock…

…Writer 1

rw_lock

rw_unlock…

…Writer 2

rd_lock

rd_unlock…

…Reader 1

rd_lock

rd_unlock…

…Reader 2

Step 3

Step 4

Shared/Exclusive Locks

Active threadSleeping thread

Legend

rw_lock

rw_unlock…

…Writer 1

rw_lock

rw_unlock…

…Writer 2

rd_lock

rd_unlock…

…Reader 1

rd_lock

rd_unlock…

…Reader 2

rw_lock

rw_unlock…

…Writer 1

rw_lock

rw_unlock…

…Writer 2

rd_lock

rd_unlock…

…Reader 1

rd_lock

rd_unlock…

…Reader 2

Step 5

Step 6

… …

Shared/Exclusive Locks

Active threadSleeping thread

Legend

rw_lock

rw_unlock…

…Writer 1

rw_lock

rw_unlock…

…Writer 2

rd_lock

rd_unlock…

…Reader 1

rd_lock

rd_unlock…

…Reader 2

Step 7… …

Condition Variable

• Block a thread while waiting for a condition• Condition_wait / condition_signal• Several thread can wait for the same

condition, they all get the signal

signal…

…Thread 1

Active threads Sleeping threads

conditionwait…

…Thread 2

wait…

…Thread 3

Condition Variable

• Block a thread while waiting for a condition• Condition_wait / condition_signal• Several thread can wait for the same

condition, they all get the signal

signal…

…Thread 1

Active threads

wait…

…Thread 3

…Thread 2

wait

Condition Variable

• Block a thread while waiting for a condition• Condition_wait / condition_signal• Several thread can wait for the same

condition, they all get the signal

signal…

…Thread 1

Active threads

wait…

…Thread 3

…Thread 2

wait

Condition Variable

• Block a thread while waiting for a condition• Condition_wait / condition_signal• Several thread can wait for the same

condition, they all get the signal

signal…

…Thread 1

Active threads

wait…

…Thread 3

…Thread 2

wait

Condition Variable

• Block a thread while waiting for a condition• Condition_wait / condition_signal• Several thread can wait for the same

condition, they all get the signal

signal…

…Thread 1

Active threads

wait…

…Thread 2

wait…

…Thread 3

Semaphores• simple counting mutexes• The semaphore can be hold by as many

threads as the initial value of the semaphore.

• When a thread get the semaphore it decrease the internal value by 1.

• When a thread release the semaphore it increase the internal value by 1.

Semaphores

get

release…

…Thread 1

Semaphore (2)get

release…

…Thread 2

get

release…

…Thread 3

get

release…

…Thread 1

Semaphore (1)get

release…

…Thread 2

get

release…

…Thread 3

Semaphores

get

release…

…Thread 1

Semaphore (0)get

release…

…Thread 2

get

release…

…Thread 3

get

release…

…Thread 1

Semaphore (0)get

release…

…Thread 2

get

release…

…Thread 3

Semaphores

get

release…

…Thread 1

Semaphore (1)get

release…

…Thread 2

get

release…

…Thread 1

Semaphore (1)get

release…

…Thread 2

get

release…

…Thread 3

get

release…

…Thread 3

Semaphores

get

release…

…Thread 1

Semaphore (1)get

release…

…Thread 2

get

release…

…Thread 1

Semaphore (2)get

release…

…Thread 2

get

release…

…Thread 3

get

release…

…Thread 3

Atomic instruction• Is any operation that a CPU can perform such

that all results will be made visible to each CPU at the same time and whose operation is safe from interference by other CPUs– TestAndSet– CompareAndSwap– DoubleCompareAndSwap– Atomic increment– Atomic decrement

Example:AProducer– ConsumerQueue

• Wehaveaboundedqueuewhereproducersstoretheiroutputandfromwhereconsumerstaketheirinput

• Protectthestructureagainstintensiveunnecessaryaccesses– Detectboundaryconditions:queueemptyandqueuefull

Boun

dedQue

ue

…producers

…consumers

Example:dot-product

• Dividethearraysbetweenparticipantstoload-balancethework– Eachwillthencomputeapartialsum

• Addallthepartialsumstogetherforthefinalresult(reduceoperation)

• Technicaldetails:costofmanagingthethreadsvs.costofthealgorithm?Howtominimizethemanagementcost?

Scalarproduct,innerproduct

=PN

n=1 ai.bi

a

b

a

b

Thr 1 Thr 2 Thr 3

Partialsum

Partialsum

Partialsum

+

+

top related