exploiting execution order and parallelism from processing flow applying pipeline-based programming...
TRANSCRIPT
Exploiting Execution Order and Parallelism
from Processing Flow Applying Pipeline-based Programming
Method on Manycore Accelerators
Shinichi YamagiwaUniversity of Tsukuba
Japan
Table of contents
1. Research backgrounds• Flow-model based programming• Graphical programming on accelerators
using flow-models
2. Finding an execution order3. Parallelism Extraction Algorithm4. Performance evaluation using manycore
accelerators5. Conclusions
Background – programming on manycore accelerators
•Programmer needs to write both programs for CPU and GPU.
•Accelerator is inserted to the peripheral bus of CPU (PCI Express).
CPU executes the controlling program.CPU downloads kernel program to accelerator. Kernel program is executed.
Download kernel
Reading results
CPU
We need to make a story for mapping/ummapping the kernel programs to accelerator by the suitable order.
Flow-model based programming: Caravela platform
4(3) Design a CPU program that maps flow-model to accelerator
Caravela Library
(2) Flow-model is stored in XML file
(4) Flow-model is executed
(1) design a flow-model
Embed kernel program using DirectX, GLSL, CUDA, OpenCL
Flow-model
Advantages:1. Programmer focuses on
designing flow-model2. Flow-model is treated like libraries for stream computing.3. Execution timing is automatically optimized.
Graphical programming on manycore accelerators
Assigning manycore accelerators to flow-models and finding automatic execution flow?
Optimized pipeline execution with concurrent execution?
Exploiting the execution order and parallelism from a pipeline flow
Intuitively, these flow-models are executed in parallel. ⇒ we assign multiple flow-models to available accelerators.
Explicit Parallelism
Intuitively we can know the execution order.⇒ we can assign an accelerator to the flow-model one by one.
These two flow-models can be executed in parallelbecause the buffers are independently used.
Implicit Parallelism
How can we exploit an execution order and the parallelism?
• How can we decide the execution order?
• Loop detection?
• How can we know the concurrent executable flow-models?
• Execution ordering• Elimination of Buffer
collision
When we consider a continuous pipeline execution …
Research objective
• Graphical programming using flow-model needs• Finding a deterministic execution order• Extracting parallelism: Implicit and Explicit parallelism
Automatic pipeline order is defined for optimized pipeline execution.
We propose two algorithms:(1) Finding a deterministic execution order (2) Parallelism Extraction Algorithm
Strategy
• Finding a deterministic execution order• Finding the first execution flow-model
• Parallelism Extraction Algorithm1. Finding an execution order2. Extracting the implicit parallelism3. Extracting the Explicit parallelism
Basic execution condition:When all input data inputs are ready,the flow-model can be executed.
Finding the first execution flow-model (Yamagiwa and Sousa, IJPEDS, 2008, world Scientific Pub.)
[Step 1] Enumerating all cyclic paths from all nodes
fm1 fm2 fm4 fm1fm2 fm4 fm1 fm2fm4 fm1 fm2 fm4
fm3 fm1 fm3fm1 fm3 fm1
[Step 2] Sorting the cyclic paths by the number of nodes included in a path
fm1 fm2 fm4 fm1fm2 fm4 fm1 fm2fm4 fm1 fm2 fm4
fm3 fm1 fm3fm1 fm3 fm1
[Step 3] Reducing the cyclic paths to the minimum set
fm1 fm3 fm1fm1 fm2 fm4 fm1
[Results]
AND
One of the edges must be initialized.fm3 fm1
fm1 fm3
fm4 fm1fm2 fm4fm1 fm2 One of the edges
must be initialized.
Parallelism Extraction Algorithm (PEA)
1. Defining the execution order by grouping three flow-models and the sub-graphs
2. Numbering 0, 1 and 2 to the groups3. Finally, listing the flow-model with the same
number in the execution list
4. Recursively repeating the operation above regarding the sub-grapghs
Grouping three flow-models and the sub-graphs
• Grouping sub-graphs of one or more flow-models
• Organizing the grapg into three sub-graphs
Numbering 0, 1 and 2 to the groups
• Numbering 0, 1 and 2 to the sub-graphs from the first executable flow-model
1
0
2
Listing the flow-model with the same number in the execution list
1
0
2
A0
B
A
1
0
C2
B
A
1
0
C2
B
A
1
0
C2
B
A
1
0
C2
B
A
1
0
0 01
01
01
B
A
1
0
C1'
B
A
0
1
1'01
2 0sync
01
2repeat
01
2 0
01
2 0
C2
B
A
1
0
01
2
(1) (2) (3) (4) (5) (6) (7) (8) (9)
C2
B
A
1
0
01
2
(10)
C2
B
A
1
0
01
2
(11)
C2
B
A
1
0
01
2
(12)
Parallelism extraction from there sub-graphs
0
1
20
Recursively repeating the previous operations
A
B
C
D
E
1
0
2
C
D
E
0’
1’
2’
01
0’0 1’
2’0’
A
B
A C
B
D
A C E
BD
01
0’01
1’2’0’0
1
1’
Implementation of Parallelism Extraction Algorithm
• We introduce• Execute matrix
• Ordering information saved in column• Parallel flows (flow-model woth the same numbers) saved in row
• Serialize array• Marks serialized pattern at every recursive iteration
• Batch matrix• Pipeline execution is saved
A0
B
A
1
0
C2
B
A
1
0
C2
B
A
1
0
C2
B
A
1
0
C2
B
A
1
0
0 01
01
01
B
A
1
0
C1'
B
A
0
1
1'01
2 0sync
01
2repeat
01
2 0
01
2 0
C2
B
A
1
0
01
2
(1) (2) (3) (4) (5) (6) (7) (8) (9)
C2
B
A
1
0
01
2
(10)
C2
B
A
1
0
01
2
(11)
C2
B
A
1
0
01
2
(12)
Example: straight flow
A
B
C
D
E
0
1
0
1
0
execute
F
F
Serialize
A
B
C
D
E
batch
A
B
A C
B
D
A C E
BD
A
B
C
D
E
1
0
2
C
D
E
0’
1’
2’(0’’)
Maximum parallelism is 3.
Example: flow with feedbacks
A
B C
D E
F
0
1
2
0
1
2
1
A
B C
D E
0
1
0
1
0
execute
F
T
Serialize
A
B C
D EF
batchB C
D E
F AMaximum parallelism is 2.
Performance Evaluation
• Image filtering • 2D FFT• High/Low pass filter• 2D IFFT
• 13 flow-modes are included in the pipeline
• After IFFT2, the results are generated
• Using PEA• Determining execution flow• Extracting parallelism
• Executing on CarSh
Reorder1
1D FFT1
Transpose1
Reorder2
1D FFT2
Transpose2
Real partImaginary
part
Filter
High pass filter
Low pass filter
FFT
IFFT
IFFT
Transpose3
Reorder3
1D IFFT1
Transpose4
Reorder4
1D IFFT2
Result
CarSh: Commandline interface for manycore accelerators(Yamagiwa and Zhang, ICCS 2013)
Exec/batch
execA execB
Exec/batch
Exec/batch
execC
Repeat for 3 times
repeat 3 exec/batchexecA &execB &SyncexecC
Background execution
synchronization
Processing flow
CarSh batch
Applying PEA to the image filtering
reorder1FFT1
FFT1reorder2
FFT1reorder2trans2
FFT1reorder2trans2
reorder1 trans1
reorder1 trans1 FFT2
reorder1 trans1 FFT2 filter
sync
trans3
FFT1reorder2trans2trans3IFFT1
FFT1reorder2trans2trans3IFFT1reorder4
reorder1 trans1 FFT2 filter reorder3
reorder1 trans1 FFT2 filter reorder3 trans4
FFT1reorder2transpose2transpose3IFFT1reorder4
reorder1 trans1 FFT2 filter reorder3 IFFT2
Reading the timer here.The execution time is calculated from the difference of the previous time.
repeat
trans4
Maximum parallelism is 7.
Performance results
• OpenCL on CPU and GPU
• We measured • Average time of the
stage at every IFFT2• Speedup
with/without parallelization
• CPU case: 4.9 times faster
• GPU case: 1.4 times faster
128*128 256*256 512*512 1024*10240
0.5
1
1.5
2
2.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6GPU (Tesla 2050)
Serialized Parallelzed Speedup
Image size
Exec
ution
tim
e
Spee
dup
128*128 256*256 512*512 1024*10240
1
2
3
4
5
6
4
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
5CPU (Corei7 Intel OpenCL)
Serialized Parallelzed Speedup
Image size
Exec
ution
tim
e
Spee
dup
Conclusions and future direction
• Graphical programming for manycore accelerator• Flow-model based programming needs
• Finding an execution flow• Parallelism extraction in the pipeline flow
• Parallelism Extraction Algorithm• Numbering 0, 1 and 2 to flow-models
• We are now implementing it on the GUI…
Eclipse plug-in for Caravela platform
CarSh environment