lecture 04-06: programming with openmp - github pages · lecture 04-06: programming with openmp...
TRANSCRIPT
Lecture04-06:ProgrammingwithOpenMP
ConcurrentandMul<coreProgramming,CSE536
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]/~yan
1
Topics(Part1)
• IntroducAon• Programmingonsharedmemorysystem(Chapter7)
– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonsharedmemorysystem(Chapter7)
– Cilk/Cilkplus(?)– PThread,mutualexclusion,locks,synchroniza<ons
• AnalysisofparallelprogramexecuAons(Chapter5)– PerformanceMetricsforParallelSystems
• Execu<onTime,Overhead,Speedup,Efficiency,Cost– ScalabilityofParallelSystems– Useofperformancetools
2
Outline
• OpenMPIntroducAon• ParallelProgrammingwithOpenMP
– OpenMPparallelregion,andworksharing– OpenMPdataenvironment,taskingandsynchronizaAon
• OpenMPPerformanceandBestPracAces• MoreCaseStudiesandExamples• ReferenceMaterials
3
WhatisOpenMP
• StandardAPItowritesharedmemoryparallelapplicaAonsinC,C++,andFortran– CompilerdirecAves,RunAmerouAnes,Environmentvariables
• OpenMPArchitectureReviewBoard(ARB)– MaintainsOpenMPspecificaAon– Permanentmembers
• AMD,Cray,Fujitsu,HP,IBM,Intel,NEC,PGI,Oracle,MicrosoZ,TexasInstruments,NVIDIA,Convey
– Auxiliarymembers• ANL,ASC/LLNL,cOMPunity,EPCC,LANL,NASA,TACC,RWTHAachenUniversity,UH
– h`p://www.openmp.org• LatestVersion4.5releasedNov2015
4
MyrolewithOpenMP
5
“HelloWord”Example/1
#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { printf("Hello World\n"); return(0); }
6
“HelloWord”-AnExample/2
#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { #pragma omp parallel { printf("Hello World\n"); } // End of parallel region return(0); }
7
“HelloWord”-AnExample/3
$ gcc –fopenmp hello.c $ export OMP_NUM_THREADS=2 $ ./a.out Hello World Hello World $ export OMP_NUM_THREADS=4 $ ./a.out Hello World Hello World Hello World Hello World $
8
#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { #pragma omp parallel { printf("Hello World\n"); } // End of parallel region return(0); }
OpenMPComponents
9
• Parallelregion
• Worksharingconstructs
• Tasking
• Offloading
• Affinity
• ErrorHanding
• SIMD
• SynchronizaAon
• Data-sharinga`ributes
• Numberofthreads
• ThreadID
• Dynamicthreadadjustment
• Nestedparallelism
• Schedule
• AcAvelevels
• Threadlimit
• NesAnglevel
• Ancestorthread
• Teamsize
• Locking
• WallclockAmer
• Numberofthreads
• Schedulingtype
• Dynamicthreadadjustment
• Nestedparallelism
• Stacksize
• Idlethreads
• AcAvelevels
• Threadlimit
Direc<ves Run<meEnvironment EnvironmentVariable
“HelloWord”-AnExample/3
#include <stdlib.h> #include <stdio.h> #include <omp.h> int main(int argc, char *argv[]) { #pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); } return(0); }
10
DirecAves
RunAmeEnvironment
“HelloWord”-AnExample/4
11
EnvironmentVariable
EnvironmentVariable:itissimilartoprogramargumentsusedtochangetheconfiguraAonoftheexecuAonwithoutrecompiletheprogram.
NOTE:theorderofprint
TheDesignPrincipleBehind
• Eachprin^isatask
• Aparallelregionistoclaimasetofcoresforcomputa<on– Coresarepresentedasmul<plethreads
• Eachthreadexecuteasingletask– Taskidisthesameasthreadid
• omp_get_thread_num()– Num_tasksisthesameastotalnumberofthreads
• omp_get_num_threads()
• 1:1mappingbetweentaskandthread– Everytask/coredosimilarworkinthissimpleexample
12
#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); }
OpenMPParallelCompu<ngSolu<onStack
13
Prog.Layer
(Ope
nMPAP
I)
RunAmelibrary
OS/system
DirecAves,Compiler OpenMPlibrary Environment
variables
ApplicaAon
End User
System
layer
Userlayer
OpenMPSyntax
14
• MostOpenMPconstructsarecompilerdirec+vesusingpragmas.– ForCandC++,thepragmastaketheform:#pragma…
• pragmavslanguage– pragmaisnotlanguage,shouldnotexpresslogics– Toprovidecompiler/preprocessoraddi<onalinforma<ononhowtoprocessingdirec<ve-annotatedcode
– Similarto#include,#define
OpenMPSyntax
15
• ForCandC++,thepragmastaketheform:#pragma omp construct [clause [clause]…]
• ForFortran,thedirecAvestakeoneoftheforms:– Fixedform *$OMP construct [clause [clause]…] C$OMP construct [clause [clause]…]
– Freeform(butworksforfixedformtoo)!$OMP construct [clause [clause]…]
• IncludefileandtheOpenMPlibmodule#include <omp.h> use omp_lib
OpenMPCompiler• OpenMP:threadprogrammingat“highlevel”.
– Theuserdoesnotneedtospecifythedetails• ProgramdecomposiAon,assignmentofworktothreads• Mappingtaskstohardwarethreads
• Usermakesstrategicdecisions• Compilerfiguresoutdetails
– CompilerflagsenableOpenMP(e.g.–openmp,-xopenmp,-fopenmp,-mp)
16
OpenMPMemoryModel
• OpenMPassumesasharedmemory• Threadscommunicatebysharingvariables.
• SynchronizaAonprotectsdataconflicts.– SynchronizaAonisexpensive.• ChangehowdataisaccessedtominimizetheneedforsynchronizaAon.
17
OpenMPFork-JoinExecu<onModel
• Master thread spawns multiple worker threads as needed, together form a team
• Parallel region is a block of code executed by all threads in a team simultaneously
18
Parallel Regions
Master thread
A Nested Parallel region
Worker threads
Fork Join
OpenMPParallelRegions
• InC/C++:ablockisasinglestatementoragroupofstatementbetween{}
• InFortran:ablockisasinglestatementoragroupofstatementsbetweendirecAve/end-direcAvepairs.
19
C$OMP PARALLEL 10 wrk(id) = garbage(id) res(id) = wrk(id)**2 if(.not.conv(res(id)) goto 10 C$OMP END PARALLEL
C$OMP PARALLEL DO do i=1,N
res(i)=bigComp(i) end do C$OMP END PARALLEL DO
#pragma omp parallel { id = omp_get_thread_num(); res[id] = lots_of_work(id); }
#pragma omp parallel for for(i=0;i<N;i++) { res[i] = big_calc(i); A[i] = B[i] + res[i]; }
20
lexical extent of parallel region
C$OMP PARALLEL call whoami C$OMP END PARALLEL
Dynamic extent of parallel region includes lexical extent
subroutine whoami external omp_get_thread_num integer iam, omp_get_thread_num iam = omp_get_thread_num() C$OMP CRITICAL print*,’Hello from ‘, iam C$OMP END CRITICAL return end
+
Orphaned directives can appear outside a parallel construct
bar.f foo.f
A parallel region can span multiple source files.
ScopeofOpenMPRegion
SPMDProgramModels
21
• SPMD(SingleProgram,Mul<pleData)forparallelregions– Allthreadsoftheparallelregionexecutethesamecode– EachthreadhasuniqueID
• UsethethreadIDtodivergetheexecuAonofthethreads– Differentthreadcanfollowdifferentpathsthroughthesame
code
• SPMDisbyfarthemostcommonlyusedpa`ernforstructuringparallelprograms– MPI,OpenMP,CUDA,etc
if(my_id == x) { } else { }
ModifytheHelloWorldProgramso…
• Onlyonethreadprintsthetotalnumberofthreads– gcc–fopenmphello.c–ohello
• Onlyonethreadreadthetotalnumberofthreadsandallthreadsprintthatinfo
22
#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); if (thread_id == 0) printf("Hello World from thread %d of %d\n", thread_id, num_threads); else printf("Hello World from thread %d\n", thread_id); }
int num_threads = 99999; #pragma omp parallel { int thread_id = omp_get_thread_num(); if (thread_id == 0) num_threads = omp_get_num_threads(); #pragma omp barrier printf("Hello World from thread %d of %d\n", thread_id, num_threads); }
Barrier
23
#pragma omp barrier
OpenMPMaster
• Denotesastructuredblockexecutedbythemasterthread
• Theotherthreadsjustskipit– nosynchronizaAonisimplied
24
#pragma omp parallel private (tmp) {
do_many_things_together(); #pragma omp master
{ exchange_boundaries_by_master_only (); } #pragma barrier do_many_other_things_together(); }
• Denotesablockofcodethatisexecutedbyonlyonethread.– Couldbemaster
• Abarrierisimpliedattheendofthesingleblock.
25
#pragma omp parallel private (tmp) {
do_many_things_together(); #pragma omp single
{ exchange_boundaries_by_one(); } do_many_other_things_together();
}
OpenMPSingle
Usingompmaster/singletomodifytheHelloWorldProgramso…
• Onlyonethreadprintsthetotalnumberofthreads
• Onlyonethreadreadthetotalnumberofthreadsandallthreadsprintthatinfo
26
#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); }
Distribu<ngWorkBasedonThreadID
27
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
#pragma omp parallel shared (a, b)
{
int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }
}
Sequential code
OpenMP parallel region
cat/proc/cpuinfo
Implemen<ngaxpyusingOpenMPparallel
28
• DividestheexecuAonoftheenclosedcoderegionamongthemembersoftheteam
• The“for” worksharingconstructsplitsuploopiteraAonsamongthreadsinateam– Eachthreadgetsoneormore“chunk”->loopchuncking
29
#pragmaompparallel#pragmaompforfor(i=0;i<N;i++){
work(i);} By default, there is a barrier at the end of the “omp
for”. Use the “nowait” clause to turn off the barrier.
#pragma omp for nowait
“nowait” is useful between two consecutive, independent omp for loops.
OpenMPWorksharingConstructs
WorksharingConstructs
30
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
#pragma omp parallel shared (a, b)
{
int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }
}
#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
Sequential code
OpenMP parallel region
OpenMP parallel region and a worksharing for construct
• schedule(staAc|dynamic|guided[,chunk])• schedule(auto|runAme)
31
staAc DistributeiteraAonsinblocksofsize"chunk"overthethreadsinaround-robinfashion
dynamic FixedporAonsofwork;sizeiscontrolledbythevalueofchunk;Whenathreadfinishes,itstartsonthenextporAonofwork
guided Samedynamicbehavioras"dynamic",butsizeoftheporAonofworkdecreasesexponenAally
auto Thecompiler(orrunAmesystem)decideswhatisbesttouse;choicecouldbeimplementaAondependent
runAme IteraAonschedulingschemeissetatrunAmethroughenvironmentvariableOMP_SCHEDULE
OpenMPscheduleClause
OpenMPSec<ons
• Worksharingconstruct• Givesadifferentstructuredblocktoeachthread
32
#pragma omp parallel #pragma omp sections { #pragma omp section
x_calculation(); #pragma omp section
y_calculation(); #pragma omp section
z_calculation(); }
By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier.
LoopCollapse
• AllowsparallelizaAonofperfectlynestedloopswithoutusingnestedparallelism
• Thecollapseclauseonfor/doloopindicateshowmanyloopsshouldbecollapsed
33
!$ompparalleldocollapse(2)...doi=il,iu,isdoj=jl,ju,jsdok=kl,ku,ks.....enddoenddoenddo!$ompendparalleldo
Exercise:OpenMPMatrixMul<plica<on
• Parallelversion• Parallelforversion
– Experimentdifferentschedulepolicyandchunksize• #omppragmaparallelfor
– Experimentcollapse(2)
34
#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
#pragma omp parallel for schedule(static) private (i) num_threads(num_ths)
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
gcc-fopenmpmm.c-omm
Barrier
• Barrier:EachthreadwaitsunAlallthreadsarrive.
35
#pragma omp parallel shared (A, B, C) private(id) {
id=omp_get_thread_num(); A[id] = big_calc1(id);
#pragma omp barrier #pragma omp for
for(i=0;i<N;i++){C[i]=big_calc3(I,A);} #pragma omp for nowait
for(i=0;i<N;i++){ B[i]=big_calc2(C, i); } A[id] = big_calc3(id);
} implicit barrier at the end of a parallel region
implicit barrier at the end of a for work-sharing construct
no implicit barrier due to nowait
• Mostvariablesaresharedbydefault• GlobalvariablesareSHAREDamongthreads
– Fortran:COMMONblocks,SAVEvariables,MODULEvariables
– C:Filescopevariables,staAc• Butnoteverythingisshared...
– Stackvariablesinsub-programscalledfromparallelregionsarePRIVATE
– AutomaAcvariablesdefinedinsidetheparallelregionarePRIVATE.
36
DataEnvironment
OpenMPDataEnvironment
double a[size][size], b=4; #pragma omp parallel private (b) { .... }
shared data a[size][size]
T0 T1 T2 T3
privatedatab’=?
b becomes undefined
privatedatab’=?
privatedatab’=?
privatedatab’=?
37
38
program sort common /input/ A(10) integer index(10)
C$OMP PARALLEL call work (index)
C$OMP END PARALLEL print*, index(1)
subroutine work (index) common /input/ A(10) integer index(*) real temp(10) integer count save count ………… !
temp
A, index, count
temp temp
A, index, count
A, index and count are shared by all threads.
temp is local to each thread
OpenMPDataEnvironment
DataEnvironment:Changingstorageaeributes
• SelecAvelychangestoragea`ributesconstructsusingthefollowingclauses– SHARED– PRIVATE– FIRSTPRIVATE– THREADPRIVATE
• Thevalueofaprivateinsideaparallelloopandglobalvalueoutsidetheloopcanbeexchangedwith– FIRSTPRIVATE,andLASTPRIVATE
• Thedefaultstatuscanbemodifiedwith:– DEFAULT(PRIVATE|SHARED|NONE)
39
OpenMPPrivateClause
• private(var)createsalocalcopyofvarforeachthread.– Thevalueisunini+alized– Privatecopyisnotstorage-associatedwiththeoriginal– Theoriginalisundefinedattheend
40
IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000
IS = IS + J END DO C$OMP END PARALLEL DO print *, IS
OpenMPPrivateClause
• private(var)createsalocalcopyofvarforeachthread.– Thevalueisunini+alized– Privatecopyisnotstorage-associatedwiththeoriginal– Theoriginalisundefinedattheend
41
IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000
IS = IS + J END DO C$OMP END PARALLEL DO print *, IS
IS was not initialized
IS is undefined here
✗✗
FirstprivateClause
• firstprivateisaspecialcaseofprivate.– IniAalizeseachprivatecopywiththecorrespondingvalue
fromthemasterthread.
42
IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20 CONTINUE C$OMP END PARALLEL DO print *, IS
FirstprivateClause
• firstprivateisaspecialcaseofprivate.– IniAalizeseachprivatecopywiththecorrespondingvalue
fromthemasterthread.
43
IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20 CONTINUE C$OMP END PARALLEL DO print *, IS
Regardless of initialization, IS is undefined at this point
Each thread gets its own IS with an initial value of 0
✔ ✗
LastprivateClause
• LastprivatepassesthevalueofaprivatefromthelastiteraAontothevariableofthemasterthread
44
IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) C$OMP& LASTPRIVATE(IS) DO 20 J=1,1000
IS = IS + J 20 CONTINUE C$OMP END PARALLEL DO print *, IS
IS is defined as its value at the last iteration (i.e. for J=1000)
Each thread gets its own IS with an initial value of 0
✔ Isthiscodemeaningful?
OpenMPReduc<on
45
l Here is the correct way to parallelize this code.
IS = 0 C$OMP PARALLEL DO REDUCTION(+:IS) DO 20 J=1,1000
IS = IS + J 20 CONTINUE print *, IS
Reduction NOT implies firstprivate, where is the initial 0 comes from?
Reduc<onoperands/ini<al-values
• AssociaAveoperandsusedwithreducAon• IniAalvaluesaretheonesthatmakesense
mathemaAcally
46
Operand Initial value
+ 0
* 1
- 0
.AND. All 1’s
Operand Initial value
.OR. 0
MAX 1
MIN 0
// All 0’s
Exercise:OpenMPSum.c
• Twoversions– ParallelforwithreducAon– Parallelversion,notusing“ompfor”or“reducAon”clause
47
OpenMPThreadprivate
• Makesglobaldataprivatetoathread,thuscrossingparallelregionboundary– Fortran:COMMONblocks– C:FilescopeandstaAcvariables
• DifferentfrommakingthemPRIVATE– WithPRIVATE,globalvariablesaremasked.– THREADPRIVATEpreservesglobalscopewithineach
thread• ThreadprivatevariablescanbeiniAalizedusing
COPYIN orbyusingDATAstatements.
48
Threadprivate/copyin
49
parameter (N=1000) common/buf/A(N) C$OMP THREADPRIVATE(/buf/) C Initialize the A array call init_data(N,A) C$OMP PARALLEL COPYIN(A) … Now each thread sees threadprivate array A initialized … to the global value set in the subroutine init_data() C$OMP END PARALLEL .... C$OMP PARALLEL ... Values of threadprivate are persistent across parallel regions C$OMP END PARALLEL
• You initialize threadprivate data using a copyin clause.
OpenMPSynchroniza<on
• HighlevelsynchronizaAon:– criAcalsecAon– atomic– barrier– ordered
• LowlevelsynchronizaAon– flush– locks(bothsimpleandnested)
50
Cri<calsec<on
• OnlyonethreadataAmecanenteracriAcalsecAon.
51
C$OMP PARALLEL DO PRIVATE(B) C$OMP& SHARED(RES)
DO 100 I=1,NITERS B = DOIT(I)
C$OMP CRITICAL CALL CONSUME (B, RES)
C$OMP END CRITICAL 100 CONTINUE C$OMP END PARALLEL DO
Atomic
• AtomicisaspecialcaseofacriAcalsecAonthatcanbeusedforcertainsimplestatements
• ItappliesonlytotheupdateofamemorylocaAon
52
C$OMP PARALLEL PRIVATE(B) B = DOIT(I)
tmp = big_ugly();
C$OMP ATOMIC X = X + temp
C$OMP END PARALLEL
OpenMPTasks
Defineatask:– C/C++:#pragmaomptask– Fortran:!$omptask
53
• Ataskisgeneratedwhenathreadencountersataskconstruct– Containsataskregionanditsdataenvironment– Taskcanbenested
• AtaskregionisaregionconsisAngofallcodeencounteredduringtheexecuAonofatask.
• ThedataenvironmentconsistsofallthevariablesassociatedwiththeexecuAonofagiventask.– constructedwhenthetaskisgenerated
Taskcomple<onandsynchroniza<on
• Taskcomple<onoccurswhenthetaskreachestheendofthetaskregioncode
• MulApletasksjoinedtocompletethroughtheuseoftasksynchroniza<onconstructs– taskwait– barrierconstruct
• taskwaitconstructs:– #pragmaomptaskwait– !$omptaskwait
54
intfib(intn){intx,y;if(n<2)returnn;else{#pragmaomptaskshared(x)x=fib(n-1);#pragmaomptaskshared(y)y=fib(n-2);#pragmaomptaskwaitreturnx+y;}}
Example:ALinkedList
55
An Overview of OpenMP
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Example - A Linked List
........
while(my_pointer) {
(void) do_independent_work (my_pointer); my_pointer = my_pointer->next ;
} // End of while loop
........
Example:ALinkedListwithTasking
56
An Overview of OpenMP
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Example - A Linked List With Tasking my_pointer = listhead;
#pragma omp parallel { #pragma omp single nowait { while(my_pointer) { #pragma omp task firstprivate(my_pointer) { (void) do_independent_work (my_pointer); } my_pointer = my_pointer->next ; } } // End of single - no implied barrier (nowait) } // End of parallel region - implied barrier
OpenMP Task is specifi ed here(executed in parallel)
Ordered
• TheorderedconstructenforcesthesequenAalorderforablock.
57
#pragma omp parallel private (tmp) #pragma omp for ordered for (i=0;i<N;i++){ tmp = NEAT_STUFF_IN_PARALLEL(i); #pragma ordered res += consum(tmp); }
OpenMPSynchroniza<on
• Theflushconstructdenotesasequencepointwhereathreadtriestocreateaconsistentviewofmemory.– AllmemoryoperaAons(bothreadsandwrites)definedprior
tothesequencepointmustcomplete.– AllmemoryoperaAons(bothreadsandwrites)definedaZer
thesequencepointmustfollowtheflush.– Variablesinregistersorwritebuffersmustbeupdatedin
memory.• Argumentstoflushspecifywhichvariablesareflushed.Noargumentsspecifiesthatallthreadvisiblevariablesareflushed.
58
Aflushexample
59
l pair-wise synchronization.
integer ISYNC(NUM_THREADS) C$OMP PARALLEL DEFAULT (PRIVATE) SHARED (ISYNC)
IAM = OMP_GET_THREAD_NUM() ISYNC(IAM) = 0
C$OMP BARRIER CALL WORK() ISYNC(IAM) = 1 ! I’m all done; signal this to other threads
C$OMP FLUSH(ISYNC) DO WHILE (ISYNC(NEIGH) .EQ. 0)
C$OMP FLUSH(ISYNC) END DO
C$OMP END PARALLEL
Make sure other threads can see my write.
Make sure the read picks up a good copy from memory.
Note: flush is analogous to a fence in other shared memory APIs.
OpenMPLockrou<nes• SimpleLockrouAnes:availableifitisunset.
– omp_init_lock(),omp_set_lock(),omp_unset_lock(),omp_test_lock(),omp_destroy_lock()
• NestedLocks:availableifitisunsetorifitissetbutownedbythethreadexecuAngthenestedlockfuncAon
– omp_init_nest_lock(),omp_set_nest_lock(),omp_unset_nest_lock(),omp_test_nest_lock(),omp_destroy_nest_lock()
60
OpenMPLocks• Protectresourceswithlocks.
61
omp_lock_t lck; omp_init_lock(&lck); #pragma omp parallel private (tmp, id) { id = omp_get_thread_num(); tmp = do_lots_of_work(id); omp_set_lock(&lck); printf(“%d %d”, id, tmp); omp_unset_lock(&lck); } omp_destroy_lock(&lck);
Wait here for your turn.
Release the lock so the next thread gets a turn.
Free-up storage when done.
OpenMPLibraryRou<nes• Modify/Checkthenumberofthreads
– omp_set_num_threads(),omp_get_num_threads(),omp_get_thread_num(),omp_get_max_threads()
• Areweinaparallelregion?– omp_in_parallel()
• Howmanyprocessorsinthesystem?– omp_num_procs()
62
OpenMPEnvironmentVariables
• Setthedefaultnumberofthreadstouse.– OMP_NUM_THREADSint_literal
• Controlhow“ompforschedule(RUNTIME)”loopiteraAonsarescheduled.– OMP_SCHEDULE“schedule[,chunk_size]”
63
Outline
• OpenMPIntroducAon• ParallelProgrammingwithOpenMP
– Worksharing,tasks,dataenvironment,synchronizaAon• OpenMPPerformanceandBestPracAces• CaseStudiesandExamples• ReferenceMaterials
64
OpenMPPerformance
• RelaAveeaseofusingOpenMPisamixedblessing• WecanquicklywriteacorrectOpenMPprogram,butwithoutthedesiredlevelofperformance.
• Therearecertain“bestpracAces” toavoidcommonperformanceproblems.
• Extraworkneededtoprogramwithlargethreadcount
65
TypicalOpenMPPerformanceIssues
66
• OverheadsofOpenMPconstructs,threadmanagement.E.g.– dynamicloopscheduleshavemuchhigheroverheadsthan
staAcschedules– SynchronizaAonisexpensive,useNOWAITifpossible– Largeparallelregionshelpreduceoverheads,enablebe`er
cacheusageandstandardopAmizaAons• OverheadsofrunAmelibraryrouAnes
– Somearecalledfrequently• Loadbalance• CacheuAlizaAonandfalsesharing
OverheadsofOpenMPDirec<ves
0
200000
400000
600000
800000
1000000
1200000
1400000
Ove
rhea
d (C
ycle
s)
1 2 4 8 16 32 64 128 256
PARALLEL
PARALLEL FOR
SINGLE
LOCK/UNLOCK
ATOMIC
Number of Threads
OpenMP OverheadsEPCC Microbenchmarks
SGI Altix 3600
PARALLEL
FOR
PARALLEL FOR
BARRIER
SINGLE
CRITICAL
LOCK/UNLOCK
ORDERED
ATOMIC
REDUCTION
67
OpenMPBestPrac<ces
• Reduceusageofbarrierwithnowaitclause
#pragmaompparallel{#pragmaompforfor(i=0;i<n;i++)….#pragmaompfornowaitfor(i=0;i<n;i++)}
68
OpenMPBestPrac<ces
#pragmaompparallelprivate(i){#pragmaompfornowaitfor(i=0;i<n;i++)a[i]+=b[i];#pragmaompfornowaitfor(i=0;i<n;i++)c[i]+=d[i];#pragmaompbarrier#pragmaompfornowaitreducAon(+:sum)for(i=0;i<n;i++)sum+=a[i]+c[i];}
69
OpenMPBestPrac<ces
• Avoidlargeorderedconstruct• AvoidlargecriAcalregions
#pragmaompparallelshared(a,b)private(c,d){….#pragmaompcriAcal{a+=2*c;c=d*d;}}
Move out this Statement 70
OpenMPBestPrac<ces
• MaximizeParallelRegions
#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop1*/}}opt=opt+N;//sequenAal#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop2*/}#pragmaompforfor(…){/*Work-sharingloopN*/}}
#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop1*/}#pragmaompsinglenowaitopt=opt+N;//sequenAal#pragmaompforfor(…){/*Work-sharingloop2*/}#pragmaompforfor(…){/*Work-sharingloopN*/}}
71
OpenMPBestPrac<ces
• Singleparallelregionenclosingallwork-sharingloops.for(i=0;i<n;i++)for(j=0;j<n;j++)pragmaompparallelforprivate(k)for(k=0;k<n;k++){……}
#pragmaompparallelprivate(i,j,k){for(i=0;i<n;i++)for(j=0;j<n;j++)#pragmaompforfor(k=0;k<n;k++){…….}} 72
OpenMPBestPrac<ces
73
• Addressloadimbalances• Useparallelfordynamicschedulesanddifferentchunksizes
Smith-Waterman Sequence Alignment Algorithm
OpenMPBestPrac<ces
• Smith-WatermanAlgorithm– DefaultscheduleisforstaAcevenàloadimbalance
74
#pragmaompforfor(…)for(…)for(…)for(…){/*computealignments*/}#pragmaompcriAcal{./*computescores*/}
OpenMPBestPrac<cesSmith-Waterman Sequence Alignment Algorithm
1
10
100
2 4 8 16 32 64 128
Speedup
threads
100
600
1000
Ideal
1
10
100
2 4 8 16 32 64 128
Speedup
threads
100
600
1000
Ideal
#pragma omp for
#pragma omp for dynamic(schedule, 1)
128 threads with 80% efficiency 75
OpenMPBestPrac<ces
0
10000
20000
30000
40000
50000
60000
70000
80000
Ove
rhea
d (in
Cyc
les)
1 2 4 8 16 32 64 128 256
Default
2
8
32
128
Number of Threads
Chunk Size
Overheads of OpenMP For Static Scheduling SGI Altix 3600
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
Cyc
les
1 2 4 8 16 32 64 128 256
1
4
16
64
Number of Threads
Chunk Size
Overheads of OpenMP For Dynamic ScheduleSGI Altix 3600
• AddressloadimbalancesbyselecAngthebestscheduleandchunksize
• AvoidselecAngsmallchunksizewhenworkinchunkissmall.
76
OpenMPBestPrac<ces
• PipelineprocessingtooverlapI/OandcomputaAons
for(i=0;i<N;i++){ReadFromFile(i,…);for(j=0;j<ProcessingNum;j++)ProcessData(i,j);WriteResultsToFile(i)}
77
OpenMPBestPrac<ces
#pragmaompparallel{#pragmaompsingle{ReadFromFile(0,...);}for(i=0;i<N;i++){#pragmaompsinglenowait{if(i<N-1)ReadFromFile(i+1,….);}#pragmaompforschedule(dynamic)for(j=0;j<ProcessingNum;j++)ProcessChunkOfData(i,j);#pragmaompsinglenowait{WriteResultsToFile(i);}}}
• PipelineProcessing• Pre-fetchesI/O• ThreadsreadingorwriAngfilesjoinsthecomputaAons
The implicit barrier here is very important: 1) file i is finished so we can write to file. 2) file i+1 is read in so we can process in the next loop iteration
78
For dealing with the last file
OpenMPBestPrac<ces
• singlevs.masterwork-sharing– masterismoreefficientbutrequiresthread0tobeavailable– singleismoreefficientifmasterthreadnotavailable– singlehasimplicitbarrier
79
CacheCoherence
• Real-worldsharedmemorysystemshavecachesbetweenmemoryandCPU
• CopiesofasingledataitemcanexistinmulAplecaches• ModificaAonofashareddataitembyoneCPUleadstooutdatedcopiesinthecacheofanotherCPU
Memory
CPU0
Cache
CPU1
Cache
Originaldataitem
CopyofdataitemincacheofCPU0 Copyofdataitem
incacheofCPU1
• Falsesharing– Whenatleastonethreadwritetoa
cachelinewhileothersaccessit• Thread0:=A[1](read)• Thread1:A[0]=…(write)
• SoluAon:usearraypadding
int a[max_threads]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i] +=i;
int a[max_threads][cache_line_size]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i][0] +=i;
OpenMPBestPrac<ces
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
False Sharing
CPUs Caches Memory
A store into a shared cache line invalidates the other copies of that line:
The system is not able to distinguish between changes
within one individual line
81
A
T0
T1
Exercise:Feelthefalsesharingwithaxpy-papi.c
82
OpenMPBestPrac<ces• DataplacementpolicyonNUMAarchitectures
• FirstTouchPolicy
– Theprocessthatfirsttouchesapageofmemorycausesthatpagetobeallocatedinthenodeonwhichtheprocessisrunning
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
A generic cc-NUMA architecture
83
NUMAFirst-touchplacement/1
84
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About “First Touch” placement/1
for (i=0; i<100; i++) a[i] = 0;
a[0] :a[99]
First TouchAll array elements are in the memory of
the processor executing this thread
int a[100]; Onlyreservethevm
address
NUMAFirst-touchplacement/2
85
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About “First Touch” placement/2
for (i=0; i<100; i++) a[i] = 0;
a[0] :a[49]
#pragma omp parallel for num_threads(2)
First TouchBoth memories each have “their half” of
the array
a[50] :a[99]
OpenMPBestPrac<ces
• First-touchinpracAce– IniAalizedataconsistentlywiththecomputaAons
#pragmaompparallelforfor(i=0;i<N;i++){a[i]=0.0;b[i]=0.0;c[i]=0.0;}readfile(a,b,c);#pragmaompparallelforfor(i=0;i<N;i++){a[i]=b[i]+c[i];}
86
• PrivaAzevariablesasmuchaspossible– Privatevariablesarestoredinthelocalstacktothethread
• Privatedataclosetocache
doublea[MaxThreads][N][N]#pragmaompparallelforfor(i=0;i<MaxThreads;i++){for(intj…)for(intk…)a[i][j][k]=…}
doublea[N][N]#pragmaompparallelprivate(a){for(intj…)for(intk…)a[j][k]=…}
OpenMPBestPracAces
87
OpenMPBestPracAces
procedurediff_coeff(){arrayallocaAonbymasterthreadiniAalizaAonofsharedarraysPARALLELREGION{looplower_bn[id],upper_bn[id]computaAononsharedarrays…..}}
• CFDapplicaAonpsudo-code– SharedarraysiniAalizedincorrectly(firsttouchpolicy)– DelaysinremotememoryaccessesareprobablecausesbysaturaAonofinterconnect
88
Stall Cycle Breakdown for Non-Privatized (NP) andPrivatized (P) Versions of diff_coeff
0.00E+005.00E+091.00E+101.50E+102.00E+102.50E+103.00E+103.50E+104.00E+104.50E+105.00E+10
D-c
ach
stal
ls
Bra
nch
mis
pred
ictio
n
Inst
ruct
ion
mis
s st
all
FLP
Uni
ts
Fron
t-end
flush
es
Cyc
les NP
PNP-P
OpenMPBestPracAces• ArrayprivaAzaAon
– Improvedtheperformanceofthewholeprogramby30%– Speedupof10fortheprocedure,nowonly5%oftotalAme
• Processorstallsarereducedsignificantly
89
• AvoidThreadMigraAon– Affectsdatalocality
• Bindthreadstocores.• Linux:
– numactl–cpubind=0foobar– taskset–c0,1foobar
• SGIAlAx– dplace–x2foobar
OpenMPBestPracAces
90
OpenMPSourceofErrors
• IncorrectuseofsynchronizaAonconstructs– LesslikelyifusersAckstodirecAves– ErroneoususeofNOWAIT
• RacecondiAons(truesharing)– Canbeveryhardtofind
• Wrong“spelling”ofsenAnel• Usetoolstocheckfordataraces.
91
Outline
• OpenMPIntroducAon• ParallelProgrammingwithOpenMP
– Worksharing,tasks,dataenvironment,synchronizaAon• OpenMPPerformanceandBestPracAces• HybridMPI/OpenMP• CaseStudiesandExamples• ReferenceMaterials
92
Matrixvectormul<plica<on
93
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The Sequential Source
= *
j
i
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The OpenMP Source
= *
j
i
Performance–2-socketNehalem
94
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Performance - 2 Socket Nehalem
2-socketNehalem
95
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
A Two Socket Nehalem System
Dataini<aliza<on
96
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Data Initialization
= *
j
i
IniAalizaAonwillcausetheallocaAonofmemoryaccordingtothefirsttouchpolicy
ExploitFirstTouch
97
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Exploit First Touch
A3Dmatrixupdate
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Observation No data dependency on 'I' Therefore we can split the 3D
matrix in larger blocks and process these in parallel
JI
K
98
Theidea
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The IdeaK
J
I
ie is
We need to distribute the M iterations over the number of processors
We do this by controlling the start (IS) and end (IE) value of the inner loop
Each thread will calculate these values for it's portion of the work
99
A3Dmatrixupdate
100
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
A 3D matrix update
The loops are correctly nested for serial performance
Due to a data dependency on J and K, only the inner loop can be parallelized
This will cause the barrier to be executed (N-1) 2 times
Data Dependency Graph
Jj-1 j
k-1k
I
K
101
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The performance
Dimensions : M=7,500 N=20Footprint : ~24 MByte
Perf
orm
ance
(Mflo
p/s)
Inner loop over I has been parallelized
Scaling is very poor(as to be expected)
Number of threads
Theperformance
Performanceanalyzerdata
102
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Performance Analyzer datascales som
ewhat
do n
ot
scal
e at
all
Using 10 threads
Question: Why is __mt_WaitForWork so high in the profi le ?
Using 20 threads
Falsesharingatwork
103Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
This is False Sharing at work !
no s
harin
g
P=1 P=2 P=4 P=8
False sharing increases as we increase the number of
threads
Performancecompared
104
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Performance comparison
Number of threads
Perf
orm
ance
(Mflo
p/s)
For a higher value of M, the program scales better
M = 7,500
M = 75,000
Thefirstimplementa<on
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The fi rst implementation
105
OpenMPversion
106
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Another Idea: Use OpenMP !
Howthisworks
107
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
How this works on 2 threads
parallel region
work sharing
parallel region
work sharing
This splits the operation in a way that is similar to our manual implementation
… etc …! … etc …!
Performance
108
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Performance We have set M=7500 N=20
This problem size does not scale at all when we explicitly parallelized the inner loop over 'I'
We have have tested 4 versions of this program Inner Loop Over 'I' - Our fi rst OpenMP version AutoPar - The automatically parallelized version of
'kernel' OMP_Chunks - The manually parallelized version
with our explicit calculation of the chunks OMP_DO - The version with the OpenMP parallel
region and work-sharing DO
Performance
109
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The performance (M=7,500)
Dimensions : M=7,500 N=20Footprint : ~24 MByte
Perfo
rman
ce (M
flop/
s)
The auto-parallelizingcompiler does really well !
Number of threads
OMP DO
Innerloop
OMP Chunks
ReferenceMaterialonOpenMP
110
• OpenMPHomepagewww.openmp.org:– TheprimarysourceofinformaAonaboutOpenMPanditsdevelopment.
• OpenMPUser’sGroup(cOMPunity)Homepage– www.compunity.org:
• Books:– UsingOpenMP,BarbaraChapman,GabrieleJost,RuudVanDerPas,Cambridge,MA:TheMITPress2007,ISBN:978-0-262-53302-7
– ParallelprogramminginOpenMP,Chandra,Rohit,SanFrancisco,Calif.:MorganKaufmann;London:Harcourt,2000,ISBN:1558606718
110
• DirecAvesimplementedviacodemodificaAonandinserAonofrunAmelibrarycalls
– Basicstepisoutliningofcodeinparallelregion
• RunAmelibraryresponsibleformanagingthreads
– Schedulingloops– Schedulingtasks– ImplemenAng
synchronizaAon• ImplementaAoneffortis
reasonable
OpenMP Code Translation
int main(void) { int a,b,c; #pragma omp parallel \ private(c) do_sth(a,b,c); return 0; }
_INT32 main() { int a,b,c; /* microtask */ void __ompregion_main1() { _INT32 __mplocal_c; /*shared variables are kept intact, substitute accesses to private variable*/ do_sth(a, b, __mplocal_c); } … /*OpenMP runtime calls */ __ompc_fork(&__ompregion_main1); … }
Eachcompilerhascustomrun-Amesupport.QualityoftherunAmesystemhasmajorimpactonperformance.
StandardOpenMPImplementa<on