lecture: manycoregpu architectures and programming, part 4€¦ · lecture: manycoregpu...
TRANSCRIPT
![Page 1: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/1.jpg)
Lecture:Manycore GPUArchitecturesandProgramming,Part4
-- IntroducingOpenACC forAccelerators
1
CSCE569ParallelComputing
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://passlab.github.io/CSCE569/
![Page 2: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/2.jpg)
Manycore GPUArchitecturesandProgramming:Outline
• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenMPandOpenACC
2
![Page 3: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/3.jpg)
OpenACC
• OpenACC’s guidingprincipleissimplicity– Wanttoremoveasmuchburdenfromtheprogrammeras
possible– Noneedtothinkaboutdatamovement,writingkernels,
parallelism,etc.– OpenACC compilersautomaticallyhandleallofthat
• Inreality,itisn’talwaysthatsimple– Don’texpecttogetmassivespeedupsfromverylittlework
• However,OpenACC canbeaneasyandstraightforwardprogrammingmodeltostartwith– http://www.openacc-standard.org/
3
![Page 4: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/4.jpg)
OpenACC
• OpenACC sharesalotofprincipleswithOpenMP– Compiler#pragma based,andrequiresacompilerthat
supportsOpenACC– Expressthetypeofparallelism,letthecompilerandruntime
handletherest– OpenACC alsoallowsyoutoexpressdatamovementusing
compiler#pragmas
#pragma acc
4
![Page 5: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/5.jpg)
OpenACC Directives
5
Programmyscience...serialcode...
!$acc kernelsdok=1,n1doi =1,n2...parallelcode...
enddoenddo
!$acc endkernels...EndProgrammyscience
CPU GPU SimpleCompilerhints
CompilerParallelizescode
Worksonmany-coreGPUs&multicoreCPUs
OpenACCCompiler
Hint
![Page 6: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/6.jpg)
OpenACC
• CreatingparallelisminOpenACC ispossiblewitheitherofthefollowingtwocomputedirectives:
#pragma acc kernels#pragma acc parallel
• kernels andparallel eachhavetheirownstrengths– kernels isahigherabstractionwithmoreautomation– parallel offersmorelow-levelcontrolbutalsorequires
moreworkfromtheprogrammer
6
![Page 7: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/7.jpg)
OpenACC ComputeDirectives
• Thekernels directivemarksacoderegionthattheprogrammerwantstoexecuteonanaccelerator– Thecoderegionisanalyzed forparallelizableloopsbythe
compiler– Necessarydatamovementisalsoautomaticallygenerated
#pragma acc kernels{
for (i = 0; i < N; i++)C[i] = A[i] + B[i];
for (i = 0; i < N; i++)D[i] = C[i] * A[i];
}7
![Page 8: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/8.jpg)
OpenACC ComputeDirectives
• LikeOpenMP,OpenACC compilerdirectivessupportclauseswhichcanbeusedtomodifythebehavior ofOpenACC#pragmas
#pragma acc kernels clause1 clause2 ...
• kernels supportsanumberofclauses,forexample:– if(cond): Onlyruntheparallelregiononanacceleratorifcond
istrue– async(id): Don’twaitfortheparallelcoderegiontocomplete
ontheacceleratorbeforereturningtothehostapplication.Instead,id canbeusedtocheckforcompletion.
– wait(id): waitfortheasync workassociatedwithid tofinishfirst
– ...8
![Page 9: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/9.jpg)
OpenACC ComputeDirectives
• Take a look at the simple-kernels.c example
– Compile with an OpenACC compiler, e.g. PGI:$ pgcc –acc simple-kernels.c –o simple-kernels
– You may be able to add compiler-specific flags to print more diagnostic information on the accelerator code generation, e.g.:
$ pgcc -acc simple-kernels.c –o simple-kernels –Minfo=accel
Wedonot havethiscompileronoursystems9
![Page 10: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/10.jpg)
OpenACC ComputeDirectives
• On the other hand, the parallel compute directive offers much more control over exactly how a parallel code region is executed
– With just kernels, we have little control over which loops are parallelized or how they are parallelized
– Think of #pragma acc parallel similarly to #pragma omp parallel
#pragma acc parallel
10
![Page 11: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/11.jpg)
OpenACC ComputeDirectives
• With parallel, all parallelism is created at the start of the parallel region and does not change until the end– The execution mode of a parallel region changes
depending on programmer-inserted #pragmas
• parallel supports similar clauses to kernels, plus:– num_gangs(g), num_workers(w), vector_length(v): Used to configure the amount of parallelism in a parallel region
– reduction(op:var1, var2, ...): Perform a reduction across gangs of the provided variables using the specified operation
– ...11
![Page 12: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/12.jpg)
OpenACC
• MappingfromtheabstractGPUExecutionModeltoOpenACC conceptsandterminology– OpenACC Vectorelement =athread• Theuseof“vector”inOpenACC terminologyemphasizesthatatthelowestlevel,OpenACC usesvector-parallelism
– OpenACC Worker =SIMTGroup• Eachworkerhasavectorwidthandcancontainmanyvectorelements
– OpenACC Gang =SIMTGroupsonthesameSM• OnegangperOpenACC PU• OpenACC supportsmultiplegangsexecutingconcurrently
12
![Page 13: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/13.jpg)
OpenACC
• MappingtoCUDAthreadingmodel:
– GangParallelism:WorkisrunacrossmultipleOpenACC Pus• CUDABlocks
– WorkerParallelism:Workisrunacrossmultipleworkers(i.e.SIMTGroups)• ThreadsperBlocks
– VectorParallelism:Workisrunacrossvectorelements(i.e.threads)• WithinWrap
13
![Page 14: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/14.jpg)
OpenACC ComputeDirectives
• In addition to kernels and parallel, a third OpenACC compute directive can help control parallelism (but does not actually create threads):
#pragma acc loop
• The loop directive allows you to explicitly mark loops as parallel and control the type of parallelism used to execute them
14
![Page 15: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/15.jpg)
OpenACC ComputeDirectives
• Using #pragma acc loop gang/worker/vectorallows you to explicitly mark loops that should use gang, worker, or vector parallelism in your OpenACCapplication– Can be used inside both parallel and kernels
regions
• Using #pragma acc independent allows you to explicitly mark loops as parallelizable, overriding any automatic compiler analysis– Compilers must naturally be conservative when auto-
parallelizing, the independent clause allows you to use detailed knowledge of the application to give hints to the compiler
15
![Page 16: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/16.jpg)
OpenACC ComputeDirectives
• Consider simple-parallel.c, in which the loopand parallel directives are used to implement the same computation as simple-kernels.c
#pragma acc parallel{
#pragma acc loopfor (i = 0; i < N; i++)
...#pragma acc loopfor (i = 0; i < N; i++)
...}
16
![Page 17: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/17.jpg)
OpenACC ComputeDirectives
• As a syntactic nicety, you can combine parallel/kernels directives with loop directives:
#pragma acc kernels loopfor (i = 0; i < N; i++) {
...}
#pragma acc parallel loopfor (i = 0; i < N; i++) {
...}
17
![Page 18: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/18.jpg)
OpenACC ComputeDirectives
• This combination has the same effect as a loopdirective immediately following a parallel/kernels directive:
#pragma acc kernels#pragma acc loopfor (i = 0; i < N; i++) { ... }
#pragma acc parallel#pragma acc loopfor (i = 0; i < N; i++) { ... }
18
![Page 19: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/19.jpg)
OpenACC ComputeDirectives
• In summary, the kernels, parallel, and loopdirectives all offer different ways to control the OpenACC parallelism of an application
– kernels is highly automated, but your rely heavily on the compiler to create an efficient parallelization strategy• A short-form of parallel/loop for GPU
– parallel is more manual, but allows programmer knowledge about the application to improve the parallelization strategy• Like OpenMP parallel
– loop allows you to take more manual control over both• Like OpenMP worksharing
19
![Page 20: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/20.jpg)
SuggestedReadings
1. ThesectionsonUsingOpenACC andUsingOpenACC ComputeDirectives inChapter8ofProfessionalCUDACProgramming
2. OpenACC Standard.2013.http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf
3. JeffLarkin.IntroductiontoAcceleratedComputingUsingCompilerDirectives.2014.http://on-demand.gputechconf.com/gtc/2014/presentations/S4167-intro-accelerated- computing-directives.pdf
4. MichaelWolfe.PerformanceAnalysisandOptimizationwithOpenACC.2014.http://on-demand.gputechconf.com/gtc/2014/presentations/S4472-performance-analysis- optimization-openacc-apps.pdf
20
![Page 21: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/21.jpg)
OpenACC DataDirectives
• #pragma acc data canbeusedtoexplicitlyperformcommunicationbetweenahostprogramandaccelerators
• Thedata clauseisappliedtoacoderegionanddefinesthecommunicationtobeperformedatthestartandendofthatcoderegion
• Thedata clausealonedoesnothing,butittakesclauseswhichdefinetheactualtransferstobeperformed
21
![Page 22: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/22.jpg)
OpenACC DataDirectives
• Common clauses used with #pragma acc data:
Clause Descriptioncopy(list) Transfer all variables in list to the
accelerator at the start of the data region and back to the host at the end.
copyin(list) Transfer all variables in list to the accelerator at the start of the data
region.
copyout(list) Transfer all variables in list back to the host at the end of the data region.
present_or_copy(list)
If the variables specified in list are not already on the accelerator, transfer them to it at the start of the data region and
back at the end.
if(cond) Only perform the operations defined by this data directive if cond is true.
22
![Page 23: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/23.jpg)
OpenACC DataDirectives
• Consider the example in simple-data.c, which mirrors simple-parallel.c and simple-kernels.c:
#pragma acc data copyin(A[0:N], B[0:N]) copyout(C[0:N], D[0:N]){#pragma acc parallel
{ #pragma acc loop
for (i = 0; i < N; i++)...
#pragma acc loopfor (i = 0; i < N; i++)
...}
}
23
![Page 24: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/24.jpg)
OpenACC DataDirectives
• OpenACC also supports:#pragma acc enter data#pragma acc exit data
• Rather than bracketing a code region, these #pragmas allow you to copy data to and from the accelerator at arbitrary points in time– Data transferred to an accelerator with enter data
will remain there until a matching exit data is reached or until the application terminates
24
![Page 25: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/25.jpg)
OpenACC DataDirectives
• Finally, OpenACC also allows you to specify data movement as part of the compute directives through data clauses
#pragma acc data copyin(A[0:N], B[0:N]) copyout(C[0:N], D[0:N]){#pragma acc parallel{}
}
#pragma acc parallel copyin(A[0:N], B[0:N]) copyout(C[0:N], D[0:N])
25
![Page 26: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/26.jpg)
OpenACC DataSpecification
• You may have noticed that OpenACC data directives use an unusual array dimension specification, for example:
#pragma acc data copy(A[start:length])
• In some cases, data specifications may not even be necessary as the OpenACC compiler can infer the size of the array:
int a[5];#pragma acc data copy(a){
...}
26
![Page 27: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/27.jpg)
OpenACC DataSpecification
• If the compiler is unable to infer an array size, error messages like the one below will be emitted– Example code:
int *a = (int *)malloc(sizeof(int) * 5);#pragma acc data copy(a){
...}
– Example error message:PGCC-S-0155-Cannot determine bounds for array a
27
![Page 28: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/28.jpg)
OpenACC DataSpecification
• Instead, you must specify the full array bounds to be transferred
int *a = (int *)malloc(sizeof(int) * 5);#pragma acc data copy(a[0:5]){
...}
– The lower bound is inclusive and, if not explicitly set, will default to 0
– The length must be provided if it cannot be inferred
28
![Page 29: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/29.jpg)
AsynchronousWorkinOpenACC
• InOpenACC,thedefaultbehavior isalwaystoblockthehostwhileexecutinganacc region– Hostexecutiondoesnotcontinuepastakernels/parallel regionuntilalloperationswithinitcomplete
– Hostexecutiondoesnotenterorexitadataregionuntilallprescribeddatatransfershavecompleted
29
![Page 30: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/30.jpg)
AsynchronousWorkinOpenACC
• Whenthehostblocks,hostcyclesarewasted:
Single-threadedhost
Acceleratorw/manyPUs
#pragma acc { ... }
Wasted cycles
30
![Page 31: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/31.jpg)
AsynchronousWorkinOpenACC
• Inmanycasesthisdefaultcanbeoverriddentoperformoperationsasynchronously– Asynchronouslycopydatatotheaccelerator– Asynchronouslyexecutecomputation
• Asaresult,hostcyclesarenotwastedidlingwhiletheacceleratorisworking
31
![Page 32: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/32.jpg)
AsynchronousWorkinOpenACC
• Asynchronousworkiscreatedusingtheasync clauseoncomputeanddatadirectives,andeveryasynchronoustaskhasanid– Runakernels regionasynchronously:
#pragma acc kernels async(id)– Runaparallel regionasynchronously:
#pragma acc parallel async(id)– Performanenter data asynchronously:
#pragma acc enter data async(id)– Performanexit data asynchronously:
#pragma acc exit data async(id)– async isnotsupportedonthedata directive
32
![Page 33: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/33.jpg)
AsynchronousWorkinOpenACC
• Havingasynchronousworkmeanswealsoneedawaytowaitforit– Notethateveryasync clauseonthepreviousslidetookanid
– Theasynchronoustaskcreatedisuniquelyidentifiedbythatid
• Wecanthenwaitonthatid usingeither:– Thewait clauseoncomputeordatadirectives– TheOpenACC RuntimeAPI’sAsynchronousControlfunctions
33
![Page 34: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/34.jpg)
AsynchronousWorkinOpenACC
• Addingawait(id) clausetoacomputeordatadirectivemakestheassociateddatatransferorcomputationwaituntiltheasynchronoustaskassociatedwiththatidcompletes
• TheOpenACC RuntimeAPIsupportsexplicitlywaitingusing:void acc_wait(int id);void acc_wait_all();
• Youcanalsocheckifasynchronoustaskshavecompletedusing:
int acc_async_test(int id);int acc_async_test_all();
34
![Page 35: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/35.jpg)
AsynchronousWorkinOpenACC
• Let’stakeasimplecodesnippetasanexample:
#pragma acc data copyin(A[0:N]) copyout(B[0:N]){#pragma acc kernels{for (i = 0; i < N; i++)B[i] = foo(A[i]);
}}do_work_on_host(C);
Host is blocked
Host is working
35
![Page 36: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/36.jpg)
AsynchronousWorkinOpenACC
Single-threadedhost
Acceleratorw/manyPUs
copyin Idling
acc kernels
copyout do_work_on_host
36
![Page 37: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/37.jpg)
AsynchronousWorkinOpenACC
• Performingthetransferandcomputeasynchronouslyallowsustooverlapthehostandacceleratorwork:
#pragma acc enter data async(0)copyin(A[0:N]) create(B[0:N])#pragma acc kernels wait(0) async(1){
for (i = 0; i < N; i++)B[i] = foo(A[i]);
}#pragma acc exit data wait(1) async(2)copyout(B[0:N])do_work_on_host(C);
acc_wait(2);
37
![Page 38: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/38.jpg)
AsynchronousWorkinOpenACC
Single-threadedhost
Acceleratorw/manyPUs
acc kernels
do_work_on_host
38
![Page 39: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/39.jpg)
ReductionsinOpenACC
• OpenACC supportstheabilitytoperformautomaticparallelreductions– Thereduction clausecanbeaddedtotheparallel andloop directives,buthasasubtledifferenceinmeaningoneach
#pragma acc parallel reduction(op:var1, var2, ...)#pragma acc loop reduction(op:var1, var2, ...)
– op definesthereductionoperationtoperform– Thevariablelistdefinesasetofprivatevariablescreatedand
initializedinthesubsequentcomputeregion
39
![Page 40: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/40.jpg)
ReductionsinOpenACC
• Whenappliedtoaparallel region,reductioncreatesaprivatecopyofeachvariableforeachgangcreatedforthatparallel region
• Whenappliedtoaloop directive,reduction createsaprivatecopyofeachvariableforeachvectorelementintheloopregion
• Theresultingvalueistransferredbacktothehostoncethecurrentcomputeregioncompletes
40
![Page 41: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/41.jpg)
OpenACC ParallelRegionOptimizations
• Tosomeextent,optimizingtheparallelcoderegionsinOpenACC iscontradictorytothewholeOpenACC principle– OpenACC wantsprogrammerstofocusonwritingapplication
logicandworrylessaboutnitty-grittyoptimizationtricks– Often,low-levelcodeoptimizationsrequireintimate
understandingofthehardwareyouarerunningon
• InOpenACC,optimizingismoreaboutavoidingsymptomaticallyhorriblescenariossothatthecompilerhasthebestcodetoworkwith,ratherthanmakingverylow-leveloptimizations– Memoryaccesspatterns– Loopscheduling
41
![Page 42: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/42.jpg)
OpenACC ParallelRegionOptimizations
• GPUsareoptimizedforaligned,coalescedmemoryaccesses– Aligned:thelowestaddressaccessedbytheelementsina
vectortobe32- or128-bitaligned(dependingonarchitecture)– Coalesced:neighboring vectorelementsaccessneighboring
memorycells
42
![Page 43: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/43.jpg)
OpenACC ParallelRegionOptimizations
• ImprovingalignmentinOpenACC isdifficultbecausethereislessvisibilityintohowOpenACC threadsarescheduledonGPU
• Improvingcoalescingisalsodifficult,theOpenACC compilermaychooseanumberofdifferentwaystoschedulealoopacrossthreadsontheGPU
• Ingeneral,trytoensurethatneighboring iterationsoftheinnermostparallelloopsarereferencingneighboringmemorycells
43
![Page 44: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/44.jpg)
OpenACC ParallelRegionOptimizations
• Vecadd exampleusingcoalescingandnoncoalescing access
CLIFlag Average ComputeTime
Without –b (coalescing) 122.02us
With –b (noncoalescing) 624.04ms
44
![Page 45: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/45.jpg)
OpenACC ParallelRegionOptimizations
• Theloop directivesupportsthreespecialclausesthatcontrolhowloopsareparallelized:gang,worker,andvector– The meaning of these clauses changes depending on whether
they are used in a parallel or kernels region
• The gang clause:– In a parallel region, causes the iterations of the loop to be
parallelized across gangs created by the parallel region, transitioning from gang-redundant to gang-partitioned mode.
– In a kernels region, does the same but also allows the user to specify the number of gangs to use, using gang(ngangs)
45
![Page 46: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/46.jpg)
OpenACC ParallelRegionOptimizations
• Theworker clause:– In a parallel region, causes the iterations of the loop
to be parallelized across workers created by the parallel region, transitioning from worker-single to worker-partitioned modes.
– In a kernels region, does the same but also allows the user to specify the number of workers per gang, using worker(nworkers)
46
![Page 47: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/47.jpg)
OpenACC ParallelRegionOptimizations
• Thevector clause:– In a parallel region, causes the iterations of the loop
to be parallelized using vector/SIMD parallelism with the vector length specified by parallel, transitioning from vector-single to vector-partitioned modes.
– In a kernels region, does the same but also allows the user to specify the vector length to use, using vector(vector_length)
47
![Page 48: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/48.jpg)
OpenACC ParallelRegionOptimizations
• Manipulatingthegang,worker,andvector clausesresultsindifferentschedulingofloopiterationsontheunderlyinghardware– Can result in significant performance improvement or
loss
• Consider the example of loop schedule– The gang and vector clauses are used to change the
parallelization of two nested loops in a parallelregion
– The # of gangs is set with the command-line flag -g, vector width is set with –v
48
![Page 49: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/49.jpg)
OpenACC ParallelRegionOptimizations
• Tryplayingwith-g and-v toseehowgang andvectoraffectperformance– Optionsforgangandvectorsizes
#pragma acc parallel copyin(A[0:M * N], B[0:M * N]) copyout(C[0:M * N])#pragma acc loop gang(gangs)
for (int i = 0; i < M; i++) {#pragma acc loop vector(vector_length)
for (int j = 0; j < N; j++) {...
}}
49
![Page 50: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/50.jpg)
OpenACC ParallelRegionOptimizations
Exampleresults:
-g -v(constant)
Time
1 128 5.7590ms
2 128 2.8855ms
4 128 1.4478ms
8 128 730.11us
16 128 373.40us
32 128 202.89us
64 128 129.85us
-g(constant)
-v Time
32 2 9.3165ms
32 8 2.7953ms
32 32 716.45us
32 128 203.02us
32 256 129.76us
32 512 125.16us
32 1024 124.83us
50
![Page 51: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/51.jpg)
OpenACC ParallelRegionOptimizations
• YouroptionsforoptimizingOpenACC parallelregionsarefairlylimited– ThewholeideaofOpenACC isthatthecompilercanhandle
thatforyou
• TherearesomethingsyoucandotoavoidpoorcodecharacteristicsontheGPUthatthatcompilercan’toptimizeyououtof(memoryaccesspatterns)
• Therearealsotunables youcantweakwhichmayimproveperformance(e.g.gang,worker,vector)
51
![Page 52: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/52.jpg)
TheTileClause
• Likethegang,worker,andvector clauses,thetileclauseisusedtocontroltheschedulingofloopiterations– Usedonloop directivesonly
• Itspecifieshowyouwouldlikeloopiterationsgroupedacrosstheiterationspace– Iterationgrouping(morecommonlycalledlooptiling)canbe
beneficialforlocalityonbothCPUsandGPUs
52
![Page 53: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/53.jpg)
TheTileClause
• Supposeyouhavealooplikethefollowing:#pragma loopfor (int i = 0; i < N; i++) {
...}
• Thetile clausecanbeaddedlikethis:#pragma loop tile(8)for (int i = 0; i < N; i++) {
...}
53
![Page 54: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/54.jpg)
TheTileClause
• Analogoustoaddingasecondinnerloop:#pragma loopfor (int i = 0; i < N; i+=8) {
for (int ii = 0; ii < 8; ii++) {...
}}
– Thesameiterationsareperformed,butthecompilermaychoosetoschedulethemdifferentlyonhardwarethreads
54
![Page 55: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/55.jpg)
TheCacheDirective
• Thecache directiveisusedtooptimizememoryaccessesontheaccelerator.Itmarksdatawhichwillbefrequentlyaccessed,andwhichthereforeshouldbekeptcloseinthecachehierarchy
• Thecache directiveisappliedimmediatelyinsideofaloopthatisbeingparallelizedontheaccelerator:– Notethesamedataspecificationisusedhereasfordata
directives#pragma acc loopfor (int i = 0; i < N; i++) {
#pragma acc cache(A[i:1])...
55
![Page 56: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/56.jpg)
TheCacheDirective
• Forexample,supposeyouhaveanapplicationwhereeverythreadi accessescellsi-1,i,andi+1 inavectorA
3 4 -1 11 7 5 2 22 5 3 6 9
Threads
A
56
![Page 57: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/57.jpg)
TheCacheDirective
• Thisresultsinlotsofwastedmemoryaccessesasneighboring elementsinthevectorreferencethesamecellsinthearrayA
• Instead, we can use the cache directive to indicate to the compiler which array elements we expect to benefit from caching:
#pragma acc parallel loopfor (int i = 0; i < N; i++) {
B[i] = A[i-1] + A[i] + A[i+1];}
#pragma acc parallel loopfor (int i = 0; i < N; i++) {
#pragma acc cache(A[i-1:2])B[i] = A[i-1] + A[i] +
A[i+1];}
57
![Page 58: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/58.jpg)
TheCacheDirective
• Now,thecompilerwillautomaticallycacheA[i-1],A[i],andA[i+1] andonlyloadthemfromacceleratormemoryonce
3 4 -1 11 7 5 2 22 5 3 6 9
Threads
A
3 4 -1 11 7 5 2 22 5 3 6 9Cache
58
![Page 59: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/59.jpg)
TheCacheDirective
• Thecache directiverequiresalotofcomplexcodeanalysisfromthecompilertoensurethisisasafeoptimization
• Asaresult,itisnotalwayspossibletousethecacheoptimizationwitharbitraryapplicationcode– Somerestructuringmaybenecessarybeforethecompileris
abletodeterminehowtoeffectivelyusethecacheoptimization
59
![Page 60: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/60.jpg)
TheCacheDirective
• Thecache directivecanresultinsignificantperformancegainsthankstomuchimproveddatalocality
• However,forcomplexapplicationsitgenerallyrequiressignificantcoderefactoringtoexposethecache-abilityofthecodetothecompiler– JustliketousesharedmemoryinCUDA
60
![Page 61: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel](https://reader034.vdocuments.site/reader034/viewer/2022042318/5f0772f87e708231d41d0977/html5/thumbnails/61.jpg)
SuggestedReadings
1. OpenACC Standard.2013.http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf
2. Peter Messmer. Optimizing OpenACC Codes. http://on-demand.gputechconf.com/gtc/2013/presentations/S3019-Optimizing-OpenACC-Codes.pdf
61