fpga-based accelerator design from a domain-specific … · 2016-09-14 · fpga-based accelerator...
TRANSCRIPT
FPGA-Based Accelerator Design from aDomain-Specific Language
M. Akif Özkan, Oliver Reiche, Frank Hannig, and Jürgen TeichHardware/Software Co-Design, Friedrich-Alexander University Erlangen-NürnbergFPL, August 29, 2016, Lausanne, Switzerland
Motivation
FPGAs High throughput, great energy efficiency
OpenCL Portable and open programming model for heterogeneoussystems, including CPUs, GPUs, DSPs etc.
© Copyright Khronos Group 2015 - Page 3
OpenCL – Portable Heterogeneous Computing•OpenCL = Two APIs and Two Kernel languages- C Platform Layer API to query, select and initialize compute devices- OpenCL C and (soon) OpenCL C++ kernel languages to write parallel code - C Runtime API to build and execute kernels across multiple devices
•One code tree can be executed on CPUs, GPUs, DSPs, FPGA and hardware- Dynamically balance work across available processors
OpenCL Kernel Code
OpenCL Kernel Code
OpenCL Kernel Code
OpenCL Kernel Code
GPU
DSPCPU
CPUFPGA
HW
Kernel code compiled for
devices
Devices
CPU
Host
Runtime APIloads and executes
kernels across devices
https://www.khronos.org/assets/uploads/developers/library/overview/opencl_overview.pdf
Motivation: High Level Synthesis using Altera OpenCL
+ Plug and accelerate: Automatic interface design+ Directives for exploiting data-level parallelism- Efficiency comes with hardware design patterns! Develop Altera OpenCL (AOC) software in Hardware Manner (HwM)
0 100 200 300 400 500 600 700
20
40
60
80
NDRange
SingleCU2
CU4
SIMD2
SIMD4
CU2/SIMD2
Line Buffer
LB CLAMP
LB CLAMP HwM
Throughput [frame/s]
Logi
cU
tiliz
atio
n[%
]
GPU KernelLine Buffer (LB)
Figure: Design points for a 5×5 Gaussian blur filter.
Motivation: High Level Synthesis using Altera OpenCL
+ Plug and accelerate: Automatic interface design+ Directives for exploiting data-level parallelism- Efficiency comes with hardware design patterns! Develop Altera OpenCL (AOC) software in Hardware Manner (HwM)
0 100 200 300 400 500 600 700
20
40
60
80
NDRange
SingleCU2
CU4
SIMD2
SIMD4
CU2/SIMD2
Line Buffer
LB CLAMP
LB CLAMP HwM
Throughput [frame/s]
Logi
cU
tiliz
atio
n[%
]
GPU KernelLine Buffer (LB)
Figure: Design points for a 5×5 Gaussian blur filter.
Motivation: High Level Synthesis using Altera OpenCL
+ Plug and accelerate: Automatic interface design+ Directives for exploiting data-level parallelism- Efficiency comes with hardware design patterns! Develop Altera OpenCL (AOC) software in Hardware Manner (HwM)
0 100 200 300 400 500 600 700
20
40
60
80
NDRange
SingleCU2
CU4
SIMD2
SIMD4
CU2/SIMD2
Line Buffer
LB CLAMP
LB CLAMP HwM
Throughput [frame/s]
Logi
cU
tiliz
atio
n[%
]
GPU KernelLine Buffer (LB)
Figure: Design points for a 5×5 Gaussian blur filter.
Motivation: High Level Synthesis using Altera OpenCL
+ Plug and accelerate: Automatic interface design+ Directives for exploiting data-level parallelism- Efficiency comes with hardware design patterns! Develop Altera OpenCL (AOC) software in Hardware Manner (HwM)
0 100 200 300 400 500 600 700
20
40
60
80
NDRange
SingleCU2
CU4
SIMD2
SIMD4
CU2/SIMD2
Line Buffer
LB CLAMP
LB CLAMP HwM
Throughput [frame/s]
Logi
cU
tiliz
atio
n[%
]
GPU KernelLine Buffer (LB)
Figure: Design points for a 5×5 Gaussian blur filter.
Motivation: High Level Synthesis using Altera OpenCL
+ Plug and accelerate: Automatic interface design+ Directives for exploiting data-level parallelism- Efficiency comes with hardware design patterns! Develop Altera OpenCL (AOC) software in Hardware Manner (HwM)
0 100 200 300 400 500 600 700
20
40
60
80
NDRange
SingleCU2
CU4
SIMD2
SIMD4
CU2/SIMD2
Line Buffer
LB CLAMP
LB CLAMP HwM
Throughput [frame/s]
Logi
cU
tiliz
atio
n[%
]
GPU KernelLine Buffer (LB)
Figure: Design points for a 5×5 Gaussian blur filter.
Outline
Programming MethodologyDeveloping Software in Hardware Manner
Efficient Code Generation through a Domain-Specific Language
Extensions to HIPAcc FrameworkAutomatic Bit-width Reduction
Evaluation and Results
Conclusion
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 0
Programming Methodology
Efficient OpenCL program for AOC
• Use hardware design patterns• Single-item line-buffered kernels for image processing
input output
win
dow
load
store
line
buffe
r
loca
l op
erat
or
f
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 1
Efficient OpenCL program for AOC
• Use hardware design patterns• Apply known hardware design optimizations
• Strength Reduction• Bit-width reduction• Use 1-bit control signals instead of repeated conditions
Listing 1: Source Code
c = a * 16;
Listing 2: Strength Reduction
c = a << 4;
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 1
Efficient OpenCL program for AOC
• Use hardware design patterns• Apply known hardware design optimizations
• Strength Reduction• Bit-width reduction• Use 1-bit control signals instead of repeated conditions
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 1
Efficient OpenCL program for AOC
• Use hardware design patterns• Apply known hardware design optimizations
• Strength Reduction• Bit-width reduction• Use 1-bit control signals instead of repeated conditions
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 1
Efficient OpenCL program for AOC
• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible
Listing 1: Source Code
if(condition1){a = 3;
}else{a = 5;
}
Listing 2: Remove else condition
a = 5;if(condition1){
a = 3;}
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2
Efficient OpenCL program for AOC
• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible
Listing 1: Source Code
for (int i = 0; i < 5; ++i){for (int j = 0; j < i; ++j){
/* operations */}
}
Listing 2: Constant loop bounds
for (int i = 0; i < 5; ++i){for (int j = 0; j < 5; ++j){
if (j < i) {/* operations */
}}
}
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2
Efficient OpenCL program for AOC
• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible
Listing 1: Source Code
a = array[i];if (condition){
b = array[i];}c = array[i]*5;
Listing 2: Source Code
char temp = array[i];a = temp;if (condition){
b = temp;}c = temp *5;
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2
Efficient OpenCL program for AOC
• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible
Listing 1: Source Code
/* code portion1 */char temp;/* code_portion2(temp) *//* code_portion3 */
Listing 2: Scope Operators
/* code portion1 */{
char temp;/* code_portion2(temp) */
}/* code_portion3 */
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2
Efficient OpenCL program for AOC
• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible
Listing 1: Source Code
char array [5][5];
Listing 2: Avoid 2 dimensional arrays
char arrayRow1 [5];char arrayRow2 [5];char arrayRow3 [5];char arrayRow4 [5];char arrayRow5 [5];
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2
Efficient OpenCL program for AOC
• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible
Listing 1: Source Code
char array [7];// Load new datafor(int i=0; i<2; i++){
char newdata = newdata[i];array[i+3] = newdata;array[i+5] = newdata;
}// Shift only a portionif(condition){
array [5] = array [6];}
Listing 2: Simplify array access
char array1 [5];char array2 [2];// Load new datafor(int i=0; i<2; i++){
char newdata = newdata[i];array1[i+3] = newdata;array2[i] = newdata;
}// Shift only a portionif(condition){
array2 [0] = array2 [1];}
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2
Proposed Approach: Develop Software in Hardware Manner
• Write Altera OpenCL code following these two principles:1. Calculate all the conditional cases first, decide at the end2. Exploit temporal locality
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 3
Proposed Approach: Develop Software in Hardware Manner
• Write Altera OpenCL code following these two principles:1. Calculate all the conditional cases first, decide at the end2. Exploit temporal locality
Listing 3: Software Manner
if (condition){calculate_a (...);result = a;
}else{calculate_b (...);result = b;
}
Listing 4: Hardware Manner
calculate_a (...);calculate_b (...);result = b;if (condition){
result = a;}
regreg in[i]compute(x, y, z) x y z
MUX
calculate_a(…)
calculate_b(…)
result
a
b
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 3
Proposed Approach: Develop Software in Hardware Manner
• Write Altera OpenCL code following these two principles:1. Calculate all the conditional cases first, decide at the end2. Exploit temporal locality
Listing 3: Software Manner
for(int i=0; i<SIZE; i++){x = in[i-2];y = in[i-1];z = in[i];out = compute(x, y, z);
}
Listing 4: Hardware Manner
for(int i=0; i<SIZE; i++){x = y;y = z;z = in[i];out = compute(x, y, z);
}
regreg in[i]compute(x, y, z) x y z
MUX
calculate_a(…)
calculate_b(…)
result
a
b
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 3
Writing HLS Code for AOC: CLAMP Boundary Handling (HW)
M M
M M
M N O P M N O P
M N O P M N O P
P P
P P
M M
M M
I I
I I
E E
E E
A A
A A
M N O P
I J K L
E F G H
A B C D
M N O P
I J K L
E F G H
A B C D
M N O P
I J K L
E F G H
A B C D
M N O P
I J K L
E F G H
A B C D
P P
P P
L L
L L
H H
H H
D D
D D
A A
A A
A B C D A B C D
A B C D A B C D
D D
D D
Figure: Clamp boundary condition on an image
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 4
Writing HLS Code for AOC: CLAMP Boundary Handling (HW)
M M
M M
M N O P M N O P
M N O P M N O P
P P
P P
M M
M M
I I
I I
E E
E E
A A
A A
M N O P
I J K L
E F G H
A B C D
M N O P
I J K L
E F G H
A B C D
M N O P
I J K L
E F G H
A B C D
M N O P
I J K L
E F G H
A B C D
P P
P P
L L
L L
H H
H H
D D
D D
A A
A A
A B C D A B C D
A B C D A B C D
D D
D D
BH_NO
BH_TL BH_T BH_TR
BH_L BH_R
BH_BL BH_B BH_BR
Figure: Clamp boundary condition on an image
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 4
Writing HLS Code for AOC: CLAMP Boundary Handling (HW)
reg30'
reg40
reg30
reg20
reg10
reg40'
reg00 M
UX
MUX
MUX
MUX
Line BufferMUX
reg31’
reg41
reg31
reg21
reg11
reg41’
reg01 M
UX
MUX
MUX
MUX
MUX
reg33’
reg43
reg33
reg23
reg13
reg43’
reg03 M
UX
MUX
MUX
MUX
MUX
reg34’
reg44
reg34
reg24
reg14
reg44’
reg04 M
UX
MUX
MUX
MUX
MUX
Line Buffer
Line Buffer
Pixel Stream
Line Buffer
reg32’
reg42
reg32
reg22
reg12
reg42’
reg02 M
UX
MUX
MUX
MUX
[3:4] ROWs SELECTION
5x5 LOCAL OPERATOR
[0:2] ROWs
Requires hardware design expertise!
Best practices on syntactic description matters!
No automatic kernel and accelerator replication!
Code Size increases upto 10x!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Writing HLS Code for AOC: CLAMP Boundary Handling (HW)
reg30'
reg40
reg30
reg20
reg10
reg40'
reg00 M
UX
MUX
MUX
MUX
Line BufferMUX
reg31’
reg41
reg31
reg21
reg11
reg41’
reg01 M
UX
MUX
MUX
MUX
MUX
reg33’
reg43
reg33
reg23
reg13
reg43’
reg03 M
UX
MUX
MUX
MUX
MUX
reg34’
reg44
reg34
reg24
reg14
reg44’
reg04 M
UX
MUX
MUX
MUX
MUX
Line Buffer
Line Buffer
Pixel Stream
Line Buffer
reg32’
reg42
reg32
reg22
reg12
reg42’
reg02 M
UX
MUX
MUX
MUX
[3:4] ROWs SELECTION
5x5 LOCAL OPERATOR
[0:2] ROWs
Requires hardware design expertise!
Best practices on syntactic description matters!
No automatic kernel and accelerator replication!
Code Size increases upto 10x!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Writing HLS Code for AOC: CLAMP Boundary Handling (HW)
reg30'
reg40
reg30
reg20
reg10
reg40'
reg00 M
UX
MUX
MUX
MUX
Line BufferMUX
reg31’
reg41
reg31
reg21
reg11
reg41’
reg01 M
UX
MUX
MUX
MUX
MUX
reg33’
reg43
reg33
reg23
reg13
reg43’
reg03 M
UX
MUX
MUX
MUX
MUX
reg34’
reg44
reg34
reg24
reg14
reg44’
reg04 M
UX
MUX
MUX
MUX
MUX
Line Buffer
Line Buffer
Pixel Stream
Line Buffer
reg32’
reg42
reg32
reg22
reg12
reg42’
reg02 M
UX
MUX
MUX
MUX
[3:4] ROWs SELECTION
5x5 LOCAL OPERATOR
[0:2] ROWs
Requires hardware design expertise!
Best practices on syntactic description matters!
No automatic kernel and accelerator replication!
Code Size increases upto 10x!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Writing HLS Code for AOC: CLAMP Boundary Handling (HW)
reg30'
reg40
reg30
reg20
reg10
reg40'
reg00 M
UX
MUX
MUX
MUX
Line BufferMUX
reg31’
reg41
reg31
reg21
reg11
reg41’
reg01 M
UX
MUX
MUX
MUX
MUX
reg33’
reg43
reg33
reg23
reg13
reg43’
reg03 M
UX
MUX
MUX
MUX
MUX
reg34’
reg44
reg34
reg24
reg14
reg44’
reg04 M
UX
MUX
MUX
MUX
MUX
Line Buffer
Line Buffer
Pixel Stream
Line Buffer
reg32’
reg42
reg32
reg22
reg12
reg42’
reg02 M
UX
MUX
MUX
MUX
[3:4] ROWs SELECTION
5x5 LOCAL OPERATOR
[0:2] ROWs
Requires hardware design expertise!
Best practices on syntactic description matters!
No automatic kernel and accelerator replication!
Code Size increases upto 10x!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Writing HLS Code for AOC: CLAMP Boundary Handling (HW)
reg30'
reg40
reg30
reg20
reg10
reg40'
reg00 M
UX
MUX
MUX
MUX
Line BufferMUX
reg31’
reg41
reg31
reg21
reg11
reg41’
reg01 M
UX
MUX
MUX
MUX
MUX
reg33’
reg43
reg33
reg23
reg13
reg43’
reg03 M
UX
MUX
MUX
MUX
MUX
reg34’
reg44
reg34
reg24
reg14
reg44’
reg04 M
UX
MUX
MUX
MUX
MUX
Line Buffer
Line Buffer
Pixel Stream
Line Buffer
reg32’
reg42
reg32
reg22
reg12
reg42’
reg02 M
UX
MUX
MUX
MUX
[3:4] ROWs SELECTION
5x5 LOCAL OPERATOR
[0:2] ROWs
Requires hardware design expertise!
Best practices on syntactic description matters!
No automatic kernel and accelerator replication!
Code Size increases upto 10x!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Efficient Code Generation through aDomain-Specific Language
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
C++
embedded DSL
Source-to-SourceCompiler
Clang/LLVM
Domain
Knowledge
Architecture
Knowledge
CUDA(GPU)
OpenCL(x86/GPU)
C/C++
(x86)
Renderscript(x86/ARM/GPU)
OpenCL(ALTERA FPGA)
Vivado C++
(FPGA)
CUDA/OpenCL/Renderscript Runtime LibraryCUDA/OpenCL/Renderscript Runtime Library AOCL Vivado HLS
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 6
Operators in the Image Processing Domain
We can define three characteristic data operations in the domain:
Point Operators:Output data is determined by single input data
Local Operators:Output data is determined by local region of the inputdata
Global Operators:Output data is determined by all of the input data
input image output image
input image output image
input image output image
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 7
Streaming Pipeline: Harris Corner Detection Example
Transform sequential execution order. . .
dx
dy
sx
sxy
sy
gx
gxy
gy
hcinput outputin
dx
dy
sx
in’
dx’
dy’
sx’
in’’
dx’’
Figure: HIPAcc’s sequential execution for the Harris corner detector
. . . into streaming pipeline of FPGA kernels.
dx
dy
sxy
sy
gxy
gy
hcinput output
sx gx
Figure: Representation of AOC kernels
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 8
Streaming Pipeline: Harris Corner Detection Example
Transform sequential execution order. . .
dx
dy
sx
sxy
sy
gx
gxy
gy
hcinput outputin
dx
dy
sx
in’
dx’
dy’
sx’
in’’
dx’’
Figure: HIPAcc’s sequential execution for the Harris corner detector
. . . into streaming pipeline of FPGA kernels.
dx
dy
sxy
sy
gxy
gy
hcinput output
sx gx
Figure: Representation of AOC kernels
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 8
Example: Laplacian Operator
// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },
{ 1, -4, 1 },{ 0, 1, 0 } };
// read inputuchar4 *image_bits = readImage ();
Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);
// load image datain = image_bits;
// Mask (Stencil) of local operatorMask <int > mask(coef);
// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);
// output imageIterationSpace <uchar4 > iter(out);
// define kernelLaplacian filter(iter , acc , mask);
// execute kernelfilter.execute ();
//Write outputuchar4* out = out.data();
InputImage
BoundaryCondition
Accessor
KERNEL
IterationSpace
Mask
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
Output image Crop of outputimage
Crop of outputimage with offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Output Image Crop of output image
Crop of output imagewith offset
OutputImage
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
C DG H
A B C DE F G H
A BE F
O PK LG HC D
M N O PI J K LE F G HA B C D
M NI JE FA B
O PK L
M N O PI J K L
M NI J
Repeat
M MM M
M N O PM N O P
P PP P
M MI IE EA A
M N O PI J K LE F G HA B C D
P PL LH HD D
A AA A
A B C DA B C D
D DD D
Clamp
I MJ N
M N O PI J K L
P LO K
N MJ IF EB A
M N O PI J K LE F G HA B C D
P OL KH GD C
E AF B
A B C DE F G H
D HC G
Mirror
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Q QQ QQ QQ Q
M N O PI J K LE F G HA B C D
Q QQ QQ QQ Q
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Constant
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling mode
Mask convolution mask
Image and boundary Image crop Image crop withoffset
Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Image andBoundary Image crop Image crop
with offset Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9
Example: Laplacian Operator
// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },
{ 1, -4, 1 },{ 0, 1, 0 } };
// read inputuchar4 *image_bits = readImage ();
Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);
// load image datain = image_bits;
// Mask (Stencil) of local operatorMask <int > mask(coef);
// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);
// output imageIterationSpace <uchar4 > iter(out);
// define kernelLaplacian filter(iter , acc , mask);
// execute kernelfilter.execute ();
//Write outputuchar4* out = out.data();
InputImage
BoundaryCondition
Accessor
KERNEL
IterationSpace
Mask
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
Output image Crop of outputimage
Crop of outputimage with offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Output Image Crop of output image
Crop of output imagewith offset
OutputImage
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
C DG H
A B C DE F G H
A BE F
O PK LG HC D
M N O PI J K LE F G HA B C D
M NI JE FA B
O PK L
M N O PI J K L
M NI J
Repeat
M MM M
M N O PM N O P
P PP P
M MI IE EA A
M N O PI J K LE F G HA B C D
P PL LH HD D
A AA A
A B C DA B C D
D DD D
Clamp
I MJ N
M N O PI J K L
P LO K
N MJ IF EB A
M N O PI J K LE F G HA B C D
P OL KH GD C
E AF B
A B C DE F G H
D HC G
Mirror
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Q QQ QQ QQ Q
M N O PI J K LE F G HA B C D
Q QQ QQ QQ Q
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Constant
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling mode
Mask convolution mask
Image and boundary Image crop Image crop withoffset
Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Image andBoundary Image crop Image crop
with offset Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9
Example: Laplacian Operator
// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },
{ 1, -4, 1 },{ 0, 1, 0 } };
// read inputuchar4 *image_bits = readImage ();
Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);
// load image datain = image_bits;
// Mask (Stencil) of local operatorMask <int > mask(coef);
// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);
// output imageIterationSpace <uchar4 > iter(out);
// define kernelLaplacian filter(iter , acc , mask);
// execute kernelfilter.execute ();
//Write outputuchar4* out = out.data();
InputImage
BoundaryCondition
Accessor
KERNEL
IterationSpace
Mask
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
Output image Crop of outputimage
Crop of outputimage with offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Output Image Crop of output image
Crop of output imagewith offset
OutputImage
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
C DG H
A B C DE F G H
A BE F
O PK LG HC D
M N O PI J K LE F G HA B C D
M NI JE FA B
O PK L
M N O PI J K L
M NI J
Repeat
M MM M
M N O PM N O P
P PP P
M MI IE EA A
M N O PI J K LE F G HA B C D
P PL LH HD D
A AA A
A B C DA B C D
D DD D
Clamp
I MJ N
M N O PI J K L
P LO K
N MJ IF EB A
M N O PI J K LE F G HA B C D
P OL KH GD C
E AF B
A B C DE F G H
D HC G
Mirror
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Q QQ QQ QQ Q
M N O PI J K LE F G HA B C D
Q QQ QQ QQ Q
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Constant
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling mode
Mask convolution mask
Image and boundary Image crop Image crop withoffset
Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Image andBoundary Image crop Image crop
with offset Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9
Example: Laplacian Operator
// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },
{ 1, -4, 1 },{ 0, 1, 0 } };
// read inputuchar4 *image_bits = readImage ();
Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);
// load image datain = image_bits;
// Mask (Stencil) of local operatorMask <int > mask(coef);
// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);
// output imageIterationSpace <uchar4 > iter(out);
// define kernelLaplacian filter(iter , acc , mask);
// execute kernelfilter.execute ();
//Write outputuchar4* out = out.data();
InputImage
BoundaryCondition
Accessor
KERNEL
IterationSpace
Mask
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
Output image Crop of outputimage
Crop of outputimage with offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Output Image Crop of output image
Crop of output imagewith offset
OutputImage
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
C DG H
A B C DE F G H
A BE F
O PK LG HC D
M N O PI J K LE F G HA B C D
M NI JE FA B
O PK L
M N O PI J K L
M NI J
Repeat
M MM M
M N O PM N O P
P PP P
M MI IE EA A
M N O PI J K LE F G HA B C D
P PL LH HD D
A AA A
A B C DA B C D
D DD D
Clamp
I MJ N
M N O PI J K L
P LO K
N MJ IF EB A
M N O PI J K LE F G HA B C D
P OL KH GD C
E AF B
A B C DE F G H
D HC G
Mirror
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Q QQ QQ QQ Q
M N O PI J K LE F G HA B C D
Q QQ QQ QQ Q
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Constant
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling mode
Mask convolution mask
Image and boundary Image crop Image crop withoffset
Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Image andBoundary Image crop Image crop
with offset Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9
Example: Laplacian Operator
// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },
{ 1, -4, 1 },{ 0, 1, 0 } };
// read inputuchar4 *image_bits = readImage ();
Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);
// load image datain = image_bits;
// Mask (Stencil) of local operatorMask <int > mask(coef);
// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);
// output imageIterationSpace <uchar4 > iter(out);
// define kernelLaplacian filter(iter , acc , mask);
// execute kernelfilter.execute ();
//Write outputuchar4* out = out.data();
InputImage
BoundaryCondition
Accessor
KERNEL
IterationSpace
Mask
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
Output image Crop of outputimage
Crop of outputimage with offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Output Image Crop of output image
Crop of output imagewith offset
OutputImage
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
C DG H
A B C DE F G H
A BE F
O PK LG HC D
M N O PI J K LE F G HA B C D
M NI JE FA B
O PK L
M N O PI J K L
M NI J
Repeat
M MM M
M N O PM N O P
P PP P
M MI IE EA A
M N O PI J K LE F G HA B C D
P PL LH HD D
A AA A
A B C DA B C D
D DD D
Clamp
I MJ N
M N O PI J K L
P LO K
N MJ IF EB A
M N O PI J K LE F G HA B C D
P OL KH GD C
E AF B
A B C DE F G H
D HC G
Mirror
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Q QQ QQ QQ Q
M N O PI J K LE F G HA B C D
Q QQ QQ QQ Q
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Constant
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Repeat Clamp Mirror Constant
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling mode
Mask convolution mask
Image and boundary Image crop Image crop withoffset
Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Image andBoundary Image crop Image crop
with offset Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9
Example: Laplacian Operator
// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },
{ 1, -4, 1 },{ 0, 1, 0 } };
// read inputuchar4 *image_bits = readImage ();
Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);
// load image datain = image_bits;
// Mask (Stencil) of local operatorMask <int > mask(coef);
// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);
// output imageIterationSpace <uchar4 > iter(out);
// define kernelLaplacian filter(iter , acc , mask);
// execute kernelfilter.execute ();
//Write outputuchar4* out = out.data();
InputImage
BoundaryCondition
Accessor
KERNEL
IterationSpace
Mask
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
Output image Crop of outputimage
Crop of outputimage with offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Output Image Crop of output image
Crop of output imagewith offset
OutputImage
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
C DG H
A B C DE F G H
A BE F
O PK LG HC D
M N O PI J K LE F G HA B C D
M NI JE FA B
O PK L
M N O PI J K L
M NI J
Repeat
M MM M
M N O PM N O P
P PP P
M MI IE EA A
M N O PI J K LE F G HA B C D
P PL LH HD D
A AA A
A B C DA B C D
D DD D
Clamp
I MJ N
M N O PI J K L
P LO K
N MJ IF EB A
M N O PI J K LE F G HA B C D
P OL KH GD C
E AF B
A B C DE F G H
D HC G
Mirror
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Q QQ QQ QQ Q
M N O PI J K LE F G HA B C D
Q QQ QQ QQ Q
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Constant
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling mode
Mask convolution mask
Image and boundary Image crop Image crop withoffset
Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Image andBoundary Image crop Image crop
with offset Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9
Example: Laplacian Operator
// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },
{ 1, -4, 1 },{ 0, 1, 0 } };
// read inputuchar4 *image_bits = readImage ();
Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);
// load image datain = image_bits;
// Mask (Stencil) of local operatorMask <int > mask(coef);
// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);
// output imageIterationSpace <uchar4 > iter(out);
// define kernelLaplacian filter(iter , acc , mask);
// execute kernelfilter.execute ();
//Write outputuchar4* out = out.data();
InputImage
BoundaryCondition
Accessor
KERNEL
IterationSpace
Mask
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
Output image Crop of outputimage
Crop of outputimage with offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Output Image Crop of output image
Crop of output imagewith offset
OutputImage
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
C DG H
A B C DE F G H
A BE F
O PK LG HC D
M N O PI J K LE F G HA B C D
M NI JE FA B
O PK L
M N O PI J K L
M NI J
Repeat
M MM M
M N O PM N O P
P PP P
M MI IE EA A
M N O PI J K LE F G HA B C D
P PL LH HD D
A AA A
A B C DA B C D
D DD D
Clamp
I MJ N
M N O PI J K L
P LO K
N MJ IF EB A
M N O PI J K LE F G HA B C D
P OL KH GD C
E AF B
A B C DE F G H
D HC G
Mirror
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Q QQ QQ QQ Q
M N O PI J K LE F G HA B C D
Q QQ QQ QQ Q
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Constant
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling mode
Mask convolution mask
Image and boundary Image crop Image crop withoffset
Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Image andBoundary Image crop Image crop
with offset Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9
Example: Laplacian Operator
// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },
{ 1, -4, 1 },{ 0, 1, 0 } };
// read inputuchar4 *image_bits = readImage ();
Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);
// load image datain = image_bits;
// Mask (Stencil) of local operatorMask <int > mask(coef);
// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);
// output imageIterationSpace <uchar4 > iter(out);
// define kernelLaplacian filter(iter , acc , mask);
// execute kernelfilter.execute ();
//Write outputuchar4* out = out.data();
InputImage
BoundaryCondition
Accessor
KERNEL
IterationSpace
Mask
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
Output image Crop of outputimage
Crop of outputimage with offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Output Image Crop of output image
Crop of output imagewith offset
OutputImage
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
C DG H
A B C DE F G H
A BE F
O PK LG HC D
M N O PI J K LE F G HA B C D
M NI JE FA B
O PK L
M N O PI J K L
M NI J
Repeat
M MM M
M N O PM N O P
P PP P
M MI IE EA A
M N O PI J K LE F G HA B C D
P PL LH HD D
A AA A
A B C DA B C D
D DD D
Clamp
I MJ N
M N O PI J K L
P LO K
N MJ IF EB A
M N O PI J K LE F G HA B C D
P OL KH GD C
E AF B
A B C DE F G H
D HC G
Mirror
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Q QQ QQ QQ Q
M N O PI J K LE F G HA B C D
Q QQ QQ QQ Q
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Constant
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling mode
Mask convolution mask
Image and boundary Image crop Image crop withoffset
Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Image andBoundary Image crop Image crop
with offset Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9
Example: Laplacian Operator
// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },
{ 1, -4, 1 },{ 0, 1, 0 } };
// read inputuchar4 *image_bits = readImage ();
Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);
// load image datain = image_bits;
// Mask (Stencil) of local operatorMask <int > mask(coef);
// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);
// output imageIterationSpace <uchar4 > iter(out);
// define kernelLaplacian filter(iter , acc , mask);
// execute kernelfilter.execute ();
//Write outputuchar4* out = out.data();
InputImage
BoundaryCondition
Accessor
KERNEL
IterationSpace
Mask
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
Output image Crop of outputimage
Crop of outputimage with offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Output Image Crop of output image
Crop of output imagewith offset
OutputImage
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
C DG H
A B C DE F G H
A BE F
O PK LG HC D
M N O PI J K LE F G HA B C D
M NI JE FA B
O PK L
M N O PI J K L
M NI J
Repeat
M MM M
M N O PM N O P
P PP P
M MI IE EA A
M N O PI J K LE F G HA B C D
P PL LH HD D
A AA A
A B C DA B C D
D DD D
Clamp
I MJ N
M N O PI J K L
P LO K
N MJ IF EB A
M N O PI J K LE F G HA B C D
P OL KH GD C
E AF B
A B C DE F G H
D HC G
Mirror
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Q QQ QQ QQ Q
M N O PI J K LE F G HA B C D
Q QQ QQ QQ Q
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Constant
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling mode
Mask convolution mask
Image and boundary Image crop Image crop withoffset
Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Image andBoundary Image crop Image crop
with offset Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9
Example: Laplacian Operator
// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },
{ 1, -4, 1 },{ 0, 1, 0 } };
// read inputuchar4 *image_bits = readImage ();
Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);
// load image datain = image_bits;
// Mask (Stencil) of local operatorMask <int > mask(coef);
// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);
// output imageIterationSpace <uchar4 > iter(out);
// define kernelLaplacian filter(iter , acc , mask);
// execute kernelfilter.execute ();
//Write outputuchar4* out = out.data();
InputImage
BoundaryCondition
Accessor
KERNEL
IterationSpace
Mask
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
Output image Crop of outputimage
Crop of outputimage with offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Output Image Crop of output image
Crop of output imagewith offset
OutputImage
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling modes
Mask convolution mask
C DG H
A B C DE F G H
A BE F
O PK LG HC D
M N O PI J K LE F G HA B C D
M NI JE FA B
O PK L
M N O PI J K L
M NI J
Repeat
M MM M
M N O PM N O P
P PP P
M MI IE EA A
M N O PI J K LE F G HA B C D
P PL LH HD D
A AA A
A B C DA B C D
D DD D
Clamp
I MJ N
M N O PI J K L
P LO K
N MJ IF EB A
M N O PI J K LE F G HA B C D
P OL KH GD C
E AF B
A B C DE F G H
D HC G
Mirror
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Q QQ QQ QQ Q
M N O PI J K LE F G HA B C D
Q QQ QQ QQ Q
Q QQ Q
Q Q Q QQ Q Q Q
Q QQ Q
Constant
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output image
Accessor input ROI with filtering
BoundaryCondition boundary handling mode
Mask convolution mask
Image and boundary Image crop Image crop withoffset
Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5
Image andBoundary Image crop Image crop
with offset Image offset
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9
Example: Laplacian Operator Kernel
class Laplacian : public Kernel <uchar4 > {private:
Accessor <uchar4 > &input;Mask <int > &mask;
public:Laplacian(IterationSpace <uchar4 > &iter ,
Accessor <uchar4 > &input , Mask <int > &mask): Kernel(iter), input(input), mask(mask) {
addAccessor (& input);}
void kernel () {int4 sum = convolve(mask , HipaccSUM , [&] () -> int4 {
return mask() * convert_int4(input(mask));});
sum = max(sum , 0);sum = min(sum , 255);output () = convert_uchar4(sum);
}};
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 10
Example: Laplacian Operator Kernel
class Laplacian : public Kernel <uchar4 > {private:
Accessor <uchar4 > &input;Mask <int > &mask;
public:Laplacian(IterationSpace <uchar4 > &iter ,
Accessor <uchar4 > &input , Mask <int > &mask): Kernel(iter), input(input), mask(mask) {
addAccessor (&input);}
void kernel () {int4 sum = convolve(mask , HipaccSUM , [&] () -> int4 {
return mask() * convert_int4(input(mask));});
sum = max(sum , 0);sum = min(sum , 255);output () = convert_uchar4(sum);
}};
Convolution call
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 10
Extensions to HIPAcc Framework
Arbitrary Bit Width Hardware Design
- The OpenCL standard contains primitive data types only, e.g., char/short/int
+ AOC supports arbitrary bit width declaration
- AOC arbitrary bit width declaration is masking
- Compound assignment operations need to be expanded
- Unary operations need to be expanded
Utilizing exact bit widths is tedious in Altera OpenCL 14.1!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 11
Arbitrary Bit Width Hardware Design
- The OpenCL standard contains primitive data types only, e.g., char/short/int
+ AOC supports arbitrary bit width declaration
- AOC arbitrary bit width declaration is masking
x = (RHS) & 0x7;
- Compound assignment operations need to be expanded
- Unary operations need to be expanded
Utilizing exact bit widths is tedious in Altera OpenCL 14.1!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 11
Arbitrary Bit Width Hardware Design
- The OpenCL standard contains primitive data types only, e.g., char/short/int
+ AOC supports arbitrary bit width declaration
- AOC arbitrary bit width declaration is masking
x = (RHS) & 0x7;
- Compound assignment operations need to be expanded
// x += (RHS) & 0x7;x = ((x & 0x7) + RHS) & 0x7;
- Unary operations need to be expanded
// x++;x = ((x & 0x7) + 1) & 0x7;
Utilizing exact bit widths is tedious in Altera OpenCL 14.1!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 11
Arbitrary Bit Width Hardware Design
- The OpenCL standard contains primitive data types only, e.g., char/short/int
+ AOC supports arbitrary bit width declaration
- AOC arbitrary bit width declaration is masking
x = (RHS) & 0x7;
- Compound assignment operations need to be expanded
// x += (RHS) & 0x7;x = ((x & 0x7) + RHS) & 0x7;
- Unary operations need to be expanded
// x++;x = ((x & 0x7) + 1) & 0x7;
Utilizing exact bit widths is tedious in Altera OpenCL 14.1!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 11
Automatic Bit Width Reduction: An extension to HIPAcc
Listing 5: Gaussian blur in HIPAcc with annotated bit widths
#pragma hipacc bw(sum ,12)uint sum = 0;#pragma hipacc bw(x,2)uint x = 0;#pragma hipacc bw(y,2)uint y = 0;for (y = 0; y < size; ++y) {
for (x = 0; x < size; ++x) {sum += mask[y][x] * Input(x-1,y-1);
}}sum /= 16;output () = sum;
↓Listing 6: Gaussian blur in HIPAcc with annotated bit widths
uint sum = 0;uint x = 0;uint y = 0;for (y = ((0) & 3); ((y) & 3) < size; y = ((((y) & 3) + 1) & 3)){
for (x = ((0) & 3); ((x) & 3) < size; x = ((((x) & 3) + 1) & 3)){sum = ((((sum) & 4095) + mask[y][x]
* getWindowAt(Input , 1 + ((x) & 3) - 1, 1 + ((y) & 3) -1)) & 4095);}
}sum = ((((sum) & 4095) / 16) & 4095);return sum;
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 12
Automatic Bit Width Reduction: An extension to HIPAcc
Listing 5: Gaussian blur in HIPAcc with annotated bit widths
#pragma hipacc bw(sum ,12)uint sum = 0;#pragma hipacc bw(x,2)uint x = 0;#pragma hipacc bw(y,2)uint y = 0;for (y = 0; y < size; ++y) {
for (x = 0; x < size; ++x) {sum += mask[y][x] * Input(x-1,y-1);
}}sum /= 16;output () = sum; ↓
Listing 6: Gaussian blur in HIPAcc with annotated bit widths
uint sum = 0;uint x = 0;uint y = 0;for (y = ((0) & 3); ((y) & 3) < size; y = ((((y) & 3) + 1) & 3)){
for (x = ((0) & 3); ((x) & 3) < size; x = ((((x) & 3) + 1) & 3)){sum = ((((sum) & 4095) + mask[y][x]
* getWindowAt(Input , 1 + ((x) & 3) - 1, 1 + ((y) & 3) -1)) & 4095);}
}sum = ((((sum) & 4095) / 16) & 4095);return sum;
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 12
Evaluation and Results
HIPAcc vs Handwritten Examples provided by Altera
Edge-DetectorAltera
Edge-DetectorHIPAcc
Lukas KanadeAltera
Lukas KanadeHIPAcc
0
10
20
30
40
50
13.5515.04
18.05
25.63
19.59 19.21
22.59
27.62
7.03
1.17
18.36
14.45
Har
dwar
eU
tiliz
atio
n[%
]M20K Logic DSP
0
100
200
300
Clo
ckFr
eque
ncy
[MH
z]
Fmax
Figure: HIPAcc vs Altera handwritten examples for 1024×1024 image size.
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 13
FPGA vs GPU
100 101 102 103 104
Gaussian
Laplacian
Sobel
Bilateral
Harris Corner
Optical Flow
Throughput [MPixel/s]
Tegra K1 Tesla K20 Stratix V
Figure: Comparison of throughput for the Nvidia Tegra K1, Nvidia Tesla K20, and Altera Stratix V.
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 14
Conclusion
Conclusion
Advantages of DSL-based Approach
Productivity – compact algorithm description– less error-prone
Performance – efficient target-specific code generation
Portability – flexible target choice– performance portability, not just functional portability
HIPAcc DSL code serves as baseline implementation⇒ Test bench
Develop the algorithm first, decide the target architecture afterwards!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 15
Conclusion
Advantages of DSL-based Approach
Productivity – compact algorithm description– less error-prone
Performance – efficient target-specific code generation
Portability – flexible target choice– performance portability, not just functional portability
HIPAcc DSL code serves as baseline implementation⇒ Test bench
Develop the algorithm first, decide the target architecture afterwards!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 15
Questions?Thanks for listening.
Any questions?
C++
embedded DSL
Source-to-SourceCompiler
Clang/LLVM
Domain
Knowledge
Architecture
Knowledge
CUDA(GPU)
OpenCL(x86/GPU)
C/C++
(x86)
Renderscript(x86/ARM/GPU)
OpenCL(ALTERA FPGA)
Vivado C++
(FPGA)
CUDA/OpenCL/Renderscript Runtime LibraryCUDA/OpenCL/Renderscript Runtime Library AOCL Vivado HLS
http://github.com/hipacc/hipacc-fpga
Title FPGA-Based Accelerator Design from a Domain-Specific Language
Speaker M. Akif Özkan, [email protected]
Backup Slides
Additional Results
Boundary Handling: SW Manner vs HW Manner
UNDEF-SW CONST-SW CLAMP-SW MIRROR-SW0
10
20
30
40
50
60
Dataset A
12.85 13.1 13.1 13.1
16.1517.55
41.11 41.42
2.3 2.3
13.79 13.79
Har
dwar
eU
tiliz
atio
n[%
]
M10K Logic DSP
0
20
40
60
80
100
120
140
160
Clo
ckFr
eque
ncy
[MH
z]
Fmax
(a) Single-Item Line-Buffered Kerneldeveloped in Software Manner
UNDEF-SW CONST-SW CLAMP-SW MIRROR-SW0
10
20
30
40
50
60
Dataset A
12.85 13.1 13.1 13.1
16.18 16.71 17.03 17.21
2.3 2.3 2.3 2.3
Har
dwar
eU
tiliz
atio
n[%
]
M10K Logic DSP
0
20
40
60
80
100
120
140
160
Clo
ckFr
eque
ncy
[MH
z]
Fmax
(b) Single-Item Line-Buffered Kerneldeveloped in Hardware Manner
Figure: Boundary conditions for a 5×5 Gaussian blur on a 1024×1024 image.
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 16
Automatic Bit Width Reduction: Results
• AOC is very successful when inner loop trip counts are known during compiletime and the unroll directive has been specified for synthesis
• Great improvement in clock frequency for dynamic loops
Table: Automatic bit width reduction on local operators of size 3×3 with image size 1024×1024.
Filter Type II ALUTs Registers Logic (%) M10K DSP Freq [MHz]
Sobelnormal 2 8552 12433 19.76 84 4 131.25reduced 2 8230 12098 19.11 84 3 152.09
Gaussiannormal 2 8593 12423 19.86 84 4 132.50reduced 2 8439 11843 19.26 83 3 153.11
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 17
Kernel Vectorization
Kernel Vectorization
Vectorization by loop coarsening [1]: unroll the outer loop by factor vi
• replicate only the kernel: sublinear increase of resource usage• better exploitation of bandwidth: nearly linear speedup
[1] Moritz Schmid et al. “Loop Coarsening in C-based High-Level Synthesis”. In: Proceedings of the 26th IEEE International Conference on Application-specificSystems, Architectures and Processors (ASAP). (Toronto, Canada). IEEE, July 27–29, 2015, pp. 166–173
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 18
Kernel Vectorization
1 2 4 8 160
20
40
60
80
100
120
14.57 14.49 14.49 14.65 14.88
22.77 24.4527.91
34.28
48.93
8.9814.45
25.39
47.27
91.02
Vectorization Factor v
Har
dwar
eU
tiliz
atio
n[%
]M20K Logic DSP
0
100
200
300
400
Clo
ckFr
eque
ncy
[MH
z]
Fmax
Figure: Vectorization of a 3×3 bilateral filter with clamping on an image of size 1024×1024.
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 19
Generating the Streaming Pipeline
Generating the Streaming Pipeline
Trace host code and translate it to internal representation:
• model as combination ofprocesses and spaces
• create unique channelobjects for each space
• identify memory reuseand utilize processesaccordingly
• build dependency graph• traverse in depth-first
search starting fromoutput spaces
1× channel2× channels
Kernel
IterSpace
Image
Accessor
Kernel
processprocess
space
process
Accessor Accessor
Kernel Kernel
process1to2
space space
process process
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20
Generating the Streaming Pipeline
Trace host code and translate it to internal representation:
• model as combination ofprocesses and spaces
• create unique channelobjects for each space
• identify memory reuseand utilize processesaccordingly
• build dependency graph• traverse in depth-first
search starting fromoutput spaces
1× channel2× channels
Kernel
IterSpace
Image
Accessor
Kernel
processprocess
space
process
Accessor Accessor
Kernel Kernel
process1to2
space space
process process
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20
Generating the Streaming Pipeline
Trace host code and translate it to internal representation:
• model as combination ofprocesses and spaces
• create unique channelobjects for each space
• identify memory reuseand utilize processesaccordingly
• build dependency graph• traverse in depth-first
search starting fromoutput spaces
1× channel2× channels
Kernel
IterSpace
Image
Accessor
Kernel
processprocess
space
process
Accessor Accessor
Kernel Kernel
process1to2
space space
process process
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20
Generating the Streaming Pipeline
Trace host code and translate it to internal representation:
• model as combination ofprocesses and spaces
• create unique channelobjects for each space
• identify memory reuseand utilize processesaccordingly
• build dependency graph• traverse in depth-first
search starting fromoutput spaces
1× channel
2× channels
Kernel
IterSpace
Image
Accessor
Kernel
processprocess
space
process
Accessor Accessor
Kernel Kernel
process1to2
space space
process process
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20
Generating the Streaming Pipeline
Trace host code and translate it to internal representation:
• model as combination ofprocesses and spaces
• create unique channelobjects for each space
• identify memory reuseand utilize processesaccordingly
• build dependency graph• traverse in depth-first
search starting fromoutput spaces
1× channel2× channels
Kernel
IterSpace
Image
Accessor
Kernel
processprocess
space
process
Accessor Accessor
Kernel Kernel
process1to2
space space
process process
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20
Generating the Streaming Pipeline
Trace host code and translate it to internal representation:
• model as combination ofprocesses and spaces
• create unique channelobjects for each space
• identify memory reuseand utilize processesaccordingly
• build dependency graph• traverse in depth-first
search starting fromoutput spaces
1× channel2× channels
Kernel
IterSpace
Image
Accessor
Kernel
processprocess
space
process
Accessor Accessor
Kernel Kernel
process1to2
space space
process process
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20
Generating the Streaming Pipeline
Trace host code and translate it to internal representation:
• model as combination ofprocesses and spaces
• create unique channelobjects for each space
• identify memory reuseand utilize processesaccordingly
• build dependency graph• traverse in depth-first
search starting fromoutput spaces
1× channel2× channels
Kernel
IterSpace
Image
Accessor
Kernel
processprocess
space
process
Accessor Accessor
Kernel Kernel
process1to2
space space
process process
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20
Generating the Streaming Pipeline
Trace host code and translate it to internal representation:
• model as combination ofprocesses and spaces
• create unique channelobjects for each space
• identify memory reuseand utilize processesaccordingly
• build dependency graph• traverse in depth-first
search starting fromoutput spaces
1× channel
2× channels
Kernel
IterSpace
Image
Accessor
Kernel
processprocess
space
process
Accessor Accessor
Kernel Kernel
process1to2
space space
process process
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20
Generating the Streaming Pipeline
Trace host code and translate it to internal representation:
• model as combination ofprocesses and spaces
• create unique channelobjects for each space
• identify memory reuseand utilize processesaccordingly
• build dependency graph• traverse in depth-first
search starting fromoutput spaces
1× channel
2× channels
Kernel
IterSpace
Image
Accessor
Kernel
processprocess
space
process
Accessor Accessor
Kernel Kernel
process1to2
space space
process process
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20
Altera OpenCL System Level Interfaces
Altera OpenCL System Level Interfaces
Channels Data sharing between kernelsFigure 1-5: Difference in Global Memory Access Pattern as a Result of Channels or PipesImplementation
Global Memory
Kernel 1 Kernel 4Kernel 2 Kernel 3Write
ReadRead
ReadRead
WriteWriteWrite
Global Memory
Kernel 1 Kernel 4Kernel 2 Kernel 3
Read
Write
Channel/Pipe Channel/Pipe
Global Memory Access Pattern Before AOCL Channels or Pipes Implementation
Global Memory Access Pattern After AOCL Channels or Pipes Implementation
Channel/Pipe
For more information on the usage of channels, refer to the Implementing AOCL Channels Extensionsection of the Altera SDK for OpenCL Programming Guide.
For more information on the usage of pipes, refer to the Implementing OpenCL Pipes section of the AlteraSDK for OpenCL Programming Guide.
Related Information
• Implementing AOCL Channels Extension• Implementing OpenCL Pipes
Channels and Pipes CharacteristicsTo implement channels or pipes in your OpenCL kernel program, keep in mind their respectivecharacteristics that are specific to the Altera SDK for OpenCL (AOCL).
Default Behavior
The default behavior of channels is blocking. The default behavior of pipes is nonblocking.
Concurrent Execution of Multiple OpenCL Kernels
You can execute multiple OpenCL kernels concurrently. To enable concurrent execution, modify the hostcode to instantiate multiple command queues. Each concurrently executing kernel is associated with aseparate command queue.Important: Pipe-specific considerations:
The OpenCL pipe modifications outlined in Ensuring Compatibility with Other OpenCLSDKs in the Altera SDK for OpenCL Programming Guide allow you to run your kernel on theAOCL. However, they do not maximize the kernel throughput. The OpenCL Specificationversion 2.0 requires that pipe writes occur before pipe reads so that the kernel is not reading
1-8 Channels and Pipes CharacteristicsOCL003-15.0.0
2015.05.04
Altera Corporation Altera SDK for OpenCL Best Practices Guide
Send Feedback
Altera, SDK. for OpenCL: Best Practices Guide, Altera, 2016.
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 21
Altera OpenCL System Level Interfaces
Channels Data sharing between kernels
Table: Impact of channels on line-buffered single-work item implementations.
Kernel Type II ALUTs Registers LU (%) BRAM DSP Freq [MHz]
Copy Kernel single 1 6821 7812 14.04 43 0 152.89Mean Filter single 1 8248 9415 16.78 51 0 151.52
2 Mean Filter separate 1 12104 15132 25.36 100 0 145.78
2 Mean Filter channel 1 10439 12466 21.47 60 0 154.142 Mean Filter fused 1 9480 11715 19.47 64 0 122.473 Mean Filter channel 1 12829 15451 26.56 67 0 145.573 Mean Filter fused 1 10899 13651 22.24 79 0 125.00
Luma+Mean single 1 8244 9694 16.84 66 2 132.56Luma+Mean channel 1 11281 13653 23.75 64 3 148.75
Luma+Mean fused 1 9635 11846 20.09 72 2 123.53
Harris corner channel 1 43127 54334 89.87 186 9 142.52Harris corner fused 1 33263 43319 70.87 177 9 107.35
- Channels use considerable amount of logic!
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 21
Altera OpenCL System Level Interfaces
Channels Data sharing between kernels
Table: Impact of channels on line-buffered single-work item implementations.
Kernel Type II ALUTs Registers LU (%) BRAM DSP Freq [MHz]
Copy Kernel single 1 6821 7812 14.04 43 0 152.89Mean Filter single 1 8248 9415 16.78 51 0 151.522 Mean Filter separate 1 12104 15132 25.36 100 0 145.78
2 Mean Filter channel 1 10439 12466 21.47 60 0 154.14
2 Mean Filter fused 1 9480 11715 19.47 64 0 122.473 Mean Filter channel 1 12829 15451 26.56 67 0 145.573 Mean Filter fused 1 10899 13651 22.24 79 0 125.00
Luma+Mean single 1 8244 9694 16.84 66 2 132.56Luma+Mean channel 1 11281 13653 23.75 64 3 148.75
Luma+Mean fused 1 9635 11846 20.09 72 2 123.53
Harris corner channel 1 43127 54334 89.87 186 9 142.52Harris corner fused 1 33263 43319 70.87 177 9 107.35
- Channels use considerable amount of logic!
+ Channels are cheaper than kernel IOs.
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 21
Altera OpenCL System Level Interfaces
Channels Data sharing between kernels
Table: Impact of channels on line-buffered single-work item implementations.
Kernel Type II ALUTs Registers LU (%) BRAM DSP Freq [MHz]
Copy Kernel single 1 6821 7812 14.04 43 0 152.89Mean Filter single 1 8248 9415 16.78 51 0 151.522 Mean Filter separate 1 12104 15132 25.36 100 0 145.78
2 Mean Filter channel 1 10439 12466 21.47 60 0 154.142 Mean Filter fused 1 9480 11715 19.47 64 0 122.473 Mean Filter channel 1 12829 15451 26.56 67 0 145.573 Mean Filter fused 1 10899 13651 22.24 79 0 125.00
Luma+Mean single 1 8244 9694 16.84 66 2 132.56Luma+Mean channel 1 11281 13653 23.75 64 3 148.75Luma+Mean fused 1 9635 11846 20.09 72 2 123.53
Harris corner channel 1 43127 54334 89.87 186 9 142.52Harris corner fused 1 33263 43319 70.87 177 9 107.35
- Channels use considerable amount of logic!
+ Channels are cheaper than kernel IOs.
- Smaller the size of kernel body, faster the clock frequency.
M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 21