exploiting execution order and parallelism from processing flow applying pipeline-based programming...

Exploiting Execution Order and Parallelism

from Processing Flow Applying　 Pipeline-based Programming

Method on Manycore Accelerators

Shinichi YamagiwaUniversity of Tsukuba

Japan

Table of contents

1. Research backgrounds• Flow-model based programming• Graphical programming on accelerators

using flow-models

2. Finding an execution order3. Parallelism Extraction Algorithm4. Performance evaluation using manycore

accelerators5. Conclusions

Background – programming on manycore accelerators

•Programmer needs to write both programs for CPU and GPU.

•Accelerator is inserted to the peripheral bus of CPU (PCI Express).

CPU executes the controlling program.CPU downloads kernel program to accelerator. Kernel program is executed.

Download kernel

Reading results

CPU

We need to make a story for mapping/ummapping the kernel programs to accelerator by the suitable order.

Flow-model based programming: Caravela platform

4（３） Design a CPU program that maps flow-model to accelerator

Caravela Library

（２） Flow-model is stored in XML file

（４） Flow-model is executed

（１） design a flow-model

Embed kernel program using DirectX, GLSL, CUDA, OpenCL

Flow-model

Advantages:１． Programmer focuses on

designing flow-model２． Flow-model is treated like libraries for stream computing.３． Execution timing is automatically optimized.

Graphical programming on manycore accelerators

Assigning manycore accelerators to flow-models and finding automatic execution flow?

Optimized pipeline execution with concurrent execution?

Exploiting the execution order and parallelism from a pipeline flow

Intuitively, these flow-models are executed in parallel. ⇒ we assign multiple flow-models to available accelerators.

Explicit Parallelism

Intuitively we can know the execution order.⇒　 we can assign an accelerator to the flow-model one by one.

These two flow-models can be executed in parallelbecause the buffers are independently used.

Implicit Parallelism

How can we exploit an execution order and the parallelism?

• How can we decide the execution order?

• Loop detection?

• How can we know the concurrent executable flow-models?

• Execution ordering• Elimination of Buffer

collision

When we consider a continuous pipeline execution …

Research objective

• Graphical programming using flow-model needs• Finding a deterministic execution order• Extracting parallelism: Implicit and Explicit parallelism

Automatic pipeline order is defined for optimized pipeline execution.

We propose two algorithms:(1) Finding a deterministic execution order (2) Parallelism Extraction Algorithm

Strategy

• Finding a deterministic execution order• Finding the first execution flow-model

• Parallelism Extraction Algorithm1. Finding an execution order2. Extracting the implicit parallelism3. Extracting the Explicit parallelism

Basic execution condition:When all input data inputs are ready,the flow-model can be executed.

Finding the first execution flow-model (Yamagiwa and Sousa, IJPEDS, 2008, world Scientific Pub.)

[Step 1] Enumerating all cyclic paths from all nodes

fm1 fm2 fm4 fm1fm2 fm4 fm1 fm2fm4 fm1 fm2 fm4

fm3 fm1 fm3fm1 fm3 fm1

[Step 2] Sorting the cyclic paths by the number of nodes included in a path

fm1 fm2 fm4 fm1fm2 fm4 fm1 fm2fm4 fm1 fm2 fm4

fm3 fm1 fm3fm1 fm3 fm1

[Step 3] Reducing the cyclic paths to the minimum set

fm1 fm3 fm1fm1 fm2 fm4 fm1

[Results]

AND

One of the edges must be initialized.fm3 fm1

fm1 fm3

fm4 fm1fm2 fm4fm1 fm2 One of the edges

must be initialized.

Parallelism Extraction Algorithm (PEA)

1. Defining the execution order by grouping three flow-models and the sub-graphs

2. Numbering 0, 1 and 2 to the groups3. Finally, listing the flow-model with the same

number in the execution list

4. Recursively repeating the operation above regarding the sub-grapghs

Grouping three flow-models and the sub-graphs

• Grouping sub-graphs of one or more flow-models

• Organizing the grapg into three sub-graphs

Numbering 0, 1 and 2 to the groups

• Numbering 0, 1 and 2 to the sub-graphs from the first executable flow-model

1

0

2

Listing the flow-model with the same number in the execution list

1

0

2

A0

B

A

1

0

C2

B

A

1

0

C2

B

A

1

0

C2

B

A

1

0

C2

B

A

1

0

0 01

01

01

B

A

1

0

C1'

B

A

0

1

1'01

2 0sync

01

2repeat

01

2 0

01

2 0

C2

B

A

1

0

01

2

(1) (2) (3) (4) (5) (6) (7) (8) (9)

C2

B

A

1

0

01

2

(10)

C2

B

A

1

0

01

2

(11)

C2

B

A

1

0

01

2

(12)

Parallelism extraction from there sub-graphs

0

1

20

Recursively repeating the previous operations

A

B

C

D

E

1

0

2

C

D

E

0’

1’

2’

01

0’0 1’

2’0’

A

B

A C

B

D

A C E

BD

01

0’01

1’2’0’0

1

1’

Implementation of Parallelism Extraction Algorithm

• We introduce• Execute matrix

• Ordering information saved in column• Parallel flows (flow-model woth the same numbers) saved in row

• Serialize array• Marks serialized pattern at every recursive iteration

• Batch matrix• Pipeline execution is saved

A0

B

A

1

0

C2

B

A

1

0

C2

B

A

1

0

C2

B

A

1

0

C2

B

A

1

0

0 01

01

01

B

A

1

0

C1'

B

A

0

1

1'01

2 0sync

01

2repeat

01

2 0

01

2 0

C2

B

A

1

0

01

2

(1) (2) (3) (4) (5) (6) (7) (8) (9)

C2

B

A

1

0

01

2

(10)

C2

B

A

1

0

01

2

(11)

C2

B

A

1

0

01

2

(12)

Example: straight flow

A

B

C

D

E

0

1

0

1

0

execute

F

F

Serialize

A

B

C

D

E

batch

A

B

A C

B

D

A C E

BD

A

B

C

D

E

1

0

2

C

D

E

0’

1’

2’(0’’)

Maximum parallelism is 3.

Example: flow with feedbacks

A

B C

D E

F

0

1

2

0

1

2

1

A

B C

D E

0

1

0

1

0

execute

F

T

Serialize

A

B C

D EF

batchB C

D E

F AMaximum parallelism is 2.

Performance Evaluation

• Image filtering • 2D FFT• High/Low pass filter• 2D IFFT

• 13 flow-modes are included in the pipeline

• After IFFT2, the results are generated

• Using PEA• Determining execution flow• Extracting parallelism

• Executing on CarSh

Reorder1

1D FFT1

Transpose1

Reorder2

1D FFT2

Transpose2

Real partImaginary

part

Filter

High pass filter

Low pass filter

FFT

IFFT

IFFT

Transpose3

Reorder3

1D IFFT1

Transpose4

Reorder4

1D IFFT2

Result

CarSh: Commandline interface for manycore accelerators(Yamagiwa and Zhang, ICCS 2013)

Exec/batch

execA execB

Exec/batch

Exec/batch

execC

Repeat for 3 times

repeat 3 exec/batchexecA &execB &SyncexecC

Background execution

synchronization

Processing flow

CarSh batch

Applying PEA to the image filtering

reorder1FFT1

FFT1reorder2

FFT1reorder2trans2

FFT1reorder2trans2

reorder1 trans1

reorder1 trans1 FFT2

reorder1 trans1 FFT2 filter

sync

trans3

FFT1reorder2trans2trans3IFFT1

FFT1reorder2trans2trans3IFFT1reorder4

reorder1 trans1 FFT2 filter reorder3

reorder1 trans1 FFT2 filter reorder3 trans4

FFT1reorder2transpose2transpose3IFFT1reorder4

reorder1 trans1 FFT2 filter reorder3 IFFT2

Reading the timer here.The execution time is calculated from the difference of the previous time.

repeat

trans4

Maximum parallelism is 7.

Performance results

• OpenCL on CPU and GPU

• We measured • Average time of the

stage at every IFFT2• Speedup

with/without parallelization

• CPU case: 4.9 times faster

• GPU case: 1.4 times faster

128*128 256*256 512*512 1024*10240

0.5

1

1.5

2

2.5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6GPU (Tesla 2050)

Serialized Parallelzed Speedup

Image size

Exec

ution

tim

e

Spee

dup

128*128 256*256 512*512 1024*10240

1

2

3

4

5

6

4

4.1

4.2

4.3

4.4

4.5

4.6

4.7

4.8

4.9

5CPU (Corei7 Intel OpenCL)

Serialized Parallelzed Speedup

Image size

Exec

ution

tim

e

Spee

dup

Conclusions and future direction

• Graphical programming for manycore accelerator• Flow-model based programming needs

• Finding an execution flow• Parallelism extraction in the pipeline flow

• Parallelism Extraction Algorithm• Numbering 0, 1 and 2 to flow-models

• We are now implementing it on the GUI…

Eclipse plug-in for Caravela platform

CarSh environment

exploiting execution order and parallelism from processing flow applying pipeline-based programming...

Documents

flowmodel flowmodel

pipeline flow

automatic execution

execution flow model

flowmodel needs

straight flow

multiple flowmodels

implicit parallelism