efficient openmp implementation and translation for ... · parallel programming models...
TRANSCRIPT
![Page 1: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/1.jpg)
Efficient OpenMP Implementation and TranslationFor Multiprocessor System-On-Chip
without Using OS
Woo-Chul Jeun and Soonhoi HaSeoul National University, Korea
![Page 2: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/2.jpg)
2 / 31
Contents
Background
OpenMP implementation for MPSoC configurations
Our OpenMP implementation and translation for a target platform
Conclusions & Future directions
![Page 3: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/3.jpg)
Background
![Page 4: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/4.jpg)
4 / 31
MPSoC(Multiprocessor System-on-Chip)
Attractive platform for computation-intensiveapplications such as multimedia encoder/decoder
Can have various architectures according to target applications• Memory: Shared, Distributed, and Hybrid• OS: SMP(symmetric multiprocessor) kernel Linux, small operating
systems, and no operating system
No standard parallel programming model for MPSoC
Not sufficient studies on various MPSoC architectures
![Page 5: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/5.jpg)
5 / 31
Parallel Programming ModelsMessage-passing Shared-address-space
Memory
Comm.
Distributed memory(private memory)
Shared memory(same address space)
• Manual optimization (MPI) vs. Easy programming (OpenMP)
Explicit message-passing Memory access
De factoStandard
MPI (1994)(message-passing interface) OpenMP (1998)
PerformanceOptimize data distribution,
data transfer, andsynchronization manually
Complicate issuesdepend on
OpenMP implementation
Programmingeasiness Difficult Easy
![Page 6: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/6.jpg)
6 / 31
OpenMP OverviewSpecification standard to represent programmer’s intension with compiler directives ( C/C++ and Fortran )OpenMP is an attractive parallel programming model for MPSoC because of easy programming.Programmers can write an OpenMP program by inserting OpenMP directives into a serial program.
#pragma omp parallel forfor( i = 0 ; i < 1000 ; i++) {
data[i] = 0;}
![Page 7: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/7.jpg)
7 / 31
OpenMP execution model : fork/join
Master thread
fork
join
Child thread
for(i=0;i<500;i++)
Processor 0 Processor 1
for(i=500;i<1000;i++)
#pragma omp parallel forfor( i = 0 ; i < 1000 ; i++) {
data[i] = 0;}
All threads divide and execute its own computation workload.
![Page 8: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/8.jpg)
8 / 31
OpenMP programming environmentOpenMP does not define how to implementOpenMP directives on a parallel processing platform.
OpenMP runtime system: OpenMP directive implementation with libraries on a target platform
OpenMP translator (for C language): converts an OpenMP program into the codes (C codes) using the OpenMP runtime system
![Page 9: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/9.jpg)
9 / 31
Hybrid execution modelSeparating parallel programming model from execution model
Application programmers use OpenMP.
OpenMP runtime system can use hybridexecution model of message passing model and shared address space model for theperformance improvement.
Easy programming and High performance
![Page 10: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/10.jpg)
10 / 31
MotivationOpenMP runtime system depends on a target platform.
OpenMP translator is customized to the OpenMP runtime system.
To get high performance on various platforms, it is necessary to research on efficient OpenMP runtime system and OpenMP translator for each target platform.
![Page 11: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/11.jpg)
11 / 31
TerminologiesBarrier synchronization• Every processor (or thread) waits until all processors (or
threads) arrive at a synchronization point.
Reduction operation• integrates operations of all processors (or threads)
into one operation.
sum+= (a+b+c);
sum+=a; sum+=b; sum+=c;
Processor 0 Processor 1 Processor 2
Sum: 0 Sum: 0 Sum: 0
Sum: a+b+c Sum: a+b+c Sum: a+b+c
Reduction operation
![Page 12: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/12.jpg)
OpenMP implementationon MPSoC configurations
![Page 13: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/13.jpg)
13 / 31
Possible OpenMP implementation on MPSoC configurations
Distributed memory Shared memory
OS withthread library
Thread programming+
SDSM : fault handler
Thread programming+
Shared memory( Yoshihiko et al. )
Without OS
Processor programming+
SDSM : message passing
Processor programming+
Shared memory( Feng Liu et al., Ours )
SDSM (software distributed shared memory) vs. Shared memoryThread programming vs. Processor programming
![Page 14: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/14.jpg)
14 / 31
Shared memory+ OS with thread library
Distributed memory Shared memory
OS withthread library
Thread programming+SDSM : fault handler
Thread programming+Shared memory
Without OS Processor programming+SDSM : message passing
Processor programming+Shared memory
No need for memory consistency protocolSimilar to thread programming in a SMP machine (Ex. dual processor PC)Yoshihiko Hotta et al., [EWOMP’2004]• SMP(symmetric multiprocessor) kernel Linux
and POSIX thread library• Similar to OpenMP implementation and translation in a SMP
machine ( dual processor PC)• They focused on power optimization
![Page 15: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/15.jpg)
15 / 31
Shared memory+ No OS
Distributed memory Shared memory
OS withthread library
Thread programming+SDSM : fault handler
Thread programming+Shared memory
Without OS Processor programming+SDSM : message passing
Processor programming+Shared memory
No need for memory consistency protocolMake processors run a program in parallel(load and initiate processors)Feng Liu et al., [WOMPAT’03, ICPP’2003]• No operating system• Their own OpenMP directive extension for DSP• OpenMP directive extension for special hardware on CT3400• Harmful barrier synchronization implementation
![Page 16: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/16.jpg)
Our OpenMP implementation and translation on a target multiprocessor system-on-chip platform
![Page 17: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/17.jpg)
17 / 31
CT3400 Architecture(Cradle Technologies, Inc.)
semaphores
Instruction cache
( 32KB )
Global bus interface
On-chip memory ( 64KB )
Off-chip memory
( 256MB )
Local data bus
Local instruction bus
Global bus
(chip)
Global semaphores
RISC-like processor
RISC-like processor
RISC-like processor
RISC-like processor
• 230MHz processor, hardware semaphores (32 local, 64 global)
• Shared memory; No operating system and no thread library
![Page 18: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/18.jpg)
18 / 31
Initialization
int main(…){}
int main(…) {initializer(…);app_main(…);
}
int app_main(…) {}
New main on master node
OpenMP translator extracts original main function to a function. (app_main())Make new main function call original mainInitialization procedure loads program codes on other processors and initiates them before application starts (initializer())
Original main
OpenMP translation
![Page 19: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/19.jpg)
19 / 31
Parallelization
#pragma omp parallel for
for(i=0;i<1000;i++){
data[i]=0;
}
parallelize(…, parallel_region_0)New mainParallel region
OpenMP translator extracts a parallel region to a functionAll processors execute the function. (cf. thread) Master processor executes serial region and other processors wait until master processor arrives at a parallel region.
OpenMP translation
void parallel_region_0(…) {
for(i=start;i<end;i++){
data[i]=0;
}
![Page 20: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/20.jpg)
20 / 31
Translation of global shared variables
‘cragcc’ C compiler on CT3400 can process global variables efficientlyOpenMP translator can translate global shared variables with two memory allocation methods.Static allocation• int data[100]; , global data area ( 0%~31% better )• OpenMP translator can inform the OpenMP runtime system
that the variables are in global data area.
Dynamic allocation• int *data; data = allocate_local(…); , heap area• OpenMP runtime system cannot know whether the variables
are global variables.
![Page 21: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/21.jpg)
21 / 31
24*24 Matrix multiplication (cycles)
2 4
Parallel (hand-written)
OpenMP, Dynamic
N/A N/A
1,827,537 914,127
2,622,901 1,320,474
1,845,050 933,549OpenMP, Static
1
3,664,513
3,653,761
5,221,225
3,674,336
Serial
Processors
• Static memory allocation is 31% better than dynamic memory allocation on CT3400.
• On the cycle-accurate simulator “Inspector” provided by Cradle technologies, Inc.
![Page 22: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/22.jpg)
22 / 31
Reduction (using temporary variable)
Uses a temporary variable. (temp_var)Similar to thread programmingEach processor updates the temporary variable with semaphore.All operations are serialized.
reduce(&_t_red, …); _t_red
reduce(&_t_red, …); _t_red4KB
char temp_var[4096]
semaphorereduce(&_t_red, …); _t_red
reduce(&_t_red, …); _t_red
reduce_0(…);
barrier
![Page 23: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/23.jpg)
23 / 31
Reduction(using temporary buffer array)
Uses a temporary buffer array (buffer).Each processor updates its own element of the array without semaphore.All operations can be executed in parallel.
reduce(&_t_red, …); _t_red
reduce(&_t_red, …); _t_red
4KB
char buffer[4][4096]memcpy
memcpy
reduce(&_t_red, …); _t_red
reduce(&_t_red, …); _t_red
memcpy
memcpy
4KB
4KB
4KB
reduce_0(…);
barrier
![Page 24: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/24.jpg)
24 / 31
EPCC OpenMP micro-benchmark (cycles)
2 4
Reduction,temporary variable
1
Reduction,temporary buffer array
8,790 14,028
7,805 12,631
1,713
1,713
Processors
• On the cycle-accurate simulator “Inspector” provided by Cradle technologies, Inc.
• Temporary buffer array method is 10% better than temporary variable method on CT3400.
![Page 25: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/25.jpg)
25 / 31
Previous harmful barrier synchronization implementation (example error case)
1 semaphore_lock(Sem.p);2 done_pe++;3 semaphore_unlock(Sem.p);4 while( done_pe < PES )5 _pe_delay(1);6 if( my_peid = = 0 )7 done_pe = 0;
Processor 0 Processor 1
Busy waiting( done_pe : 1 , PES : 2)
PES (Number of processors) : 2done_pe (counter variable for synchronization)my_peid (processor ID)Processor 0 increases ‘done_pe’ to 1 with semaphore and does busy waiting.
![Page 26: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/26.jpg)
26 / 31
Previous harmful barrier synchronization implementation (example error case)
1 semaphore_lock(Sem.p);2 done_pe++;3 semaphore_unlock(Sem.p);4 while( done_pe < PES )5 _pe_delay(1);6 if( my_peid = = 0 )7 done_pe = 0;
Processor 0 Processor 1
Busy waiting( done_pe : 2 , PES : 2)
( done_pe : 2 , PES : 2)
Processor 1 increases the counter.Processor 0 can exit the busy waiting loop.
![Page 27: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/27.jpg)
27 / 31
Previous harmful barrier synchronization implementation (example error case)
1 semaphore_lock(Sem.p);2 done_pe++;3 semaphore_unlock(Sem.p);4 while( done_pe < PES )5 _pe_delay(1);6 if( my_peid = = 0 )7 done_pe = 0;
Processor 0 Processor 1
( done_pe : 0 , PES : 2)
( done_pe : 0 , PES : 2)
Processor 0 initializes the counter variable before processor 1 checks the value of the counter variable.
![Page 28: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/28.jpg)
28 / 31
Previous harmful barrier synchronization implementation (example error case)
1 semaphore_lock(Sem.p);2 done_pe++;3 semaphore_unlock(Sem.p);4 while( done_pe < PES )5 _pe_delay(1);6 if( my_peid = = 0 )7 done_pe = 0;
Processor 0 Processor 1
( done_pe : 0 , PES : 2)
Busy waiting (wrong!!)( done_pe : 0 , PES : 2)
Processor 1 cannot exit the loop and it fails for synchronization (wrong)Wrong assumption of this implementation: last processor is always processor 0 and it initializes the counter of current barrier
![Page 29: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/29.jpg)
29 / 31
Our barrier implementation
Introduce a phase variable and toggle phase of barrier to discriminate consequent barriersInitialize the counter of next barrier
1 semaphore_lock(Sem.p);2 done_pe++;3 semaphore_unlock(Sem.p);4 while( done_pe < PES )5 _pe_delay(1);6 if( my_peid = = 0 )7 done_pe = 0;
1 semaphore_lock(Sem.p);2 phase = (phase + 1) % 2;3 if( done_pe[phase] + 1 = = PES )4 done_pe[( phase + 1 ) % 2 ] = 0;5 done_pe[phase]++;6 semaphore_unlock(Sem.p);7 while( done_pe[phase] < PES )8 _pe_delay(1);
![Page 30: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/30.jpg)
30 / 31
Our barrier implementation
1 semaphore_lock(Sem.p);2 phase = (phase + 1) % 2;3 if( done_pe[phase] + 1 = = PES )4 done_pe[( phase + 1 ) % 2 ] = 0;5 done_pe[phase]++;6 semaphore_unlock(Sem.p);7 while( done_pe[phase] < PES )8 _pe_delay(1);
Processor 0 Processor 1
Busy waiting
phase: 1
done_pe[0]: 0
done_pe[1]: 1
Initialize the counter variable of next barrier and keep the counter variable of current barrier at the same time.
phase: 1
done_pe[0]: 0
done_pe[1]: 1
![Page 31: Efficient OpenMP Implementation and Translation For ... · Parallel Programming Models Message-passing Shared-address-space Memory Comm. Distributed memory (private memory) Shared](https://reader030.vdocuments.site/reader030/viewer/2022040607/5ebd5bac9ed59526b309c2a7/html5/thumbnails/31.jpg)
31 / 31
Conclusions & Future directions
When we translate global shared variables, static memory allocation is 31% better than dynamic memory allocation.For the reduction implementation, temporary buffer array method is 10% better thantemporary variable method.We fixed previous harmful barrier synchronization implementation.Future directions• MPSoC with Distributed Memory• MPSoC with Heterogeneous processors (Ex. DSP)