design-space exploration of embedded hardware accelerators
TRANSCRIPT
Design-Space Exploration of Embedded Hardware
Accelerators for Image Processing Applications
by
Onur Can Ulusel
B.S., Sabanci University; Istanbul, Turkey, 2008
Sc.M., Sabanci University; Istanbul, Turkey, 2010
A dissertation submitted in partial fulfillment of the
requirements for the degree of Doctor of Philosophy
in The School of Engineering at Brown University
PROVIDENCE, RHODE ISLAND
May 2016
c© Copyright 2016 by Onur Can Ulusel
This dissertation by Onur Can Ulusel is accepted in its present form
by The School of Engineering as satisfying the
dissertation requirement for the degree of Doctor of Philosophy.
Date
Iris Bahar, Ph.D., Advisor
Recommended to the Graduate Council
Date
Sherief Reda, Ph.D., Reader
Date
Benjamin Kimia, Ph.D., Reader
Approved by the Graduate Council
Date
Peter Weber, Dean of the Graduate School
iii
Vitae
Onur Can Ulusel was born in Balıkesir, Turkey on November 14, 1986. He
received his B.Sc and M.Sc in Electronics Engineering from Sabanci University in
2008 and 2010. He then came to Brown University, Providence, Rhode Island to
pursue a Doctor of Philosophy in Engineering.
His research interests include the exploration of parallel computing techniques for
low-power embedded systems, power reduction techniques for reconfigurable comput-
ing and the development of design methods to accelerate image processing systems.
onur [email protected]
Brown University, RI, USA
iv
Acknowledgements
First and foremost, I thank my advisor Professor Iris Bahar for her patience, encour-
agement and guidance throughout my study. She has been a great mentor to me and
I feel privileged to be her student. I would also like to thank Professor Sherief Reda
who has provided great insight and invaluable feedback throughout my studies.
I am thankful to Professor Benjamin Kimia for agreeing to serve as a member
in my dissertation committee even at hardship. His comments and questions have
made this work better.
I also would like to thank all my friends and colleagues from Barus and Holley
3rd floor. Thanks to them my staying in Brown has been an even more pleasant ex-
perience. I would like to thank Kumud, Marco, Dimitra, Kapil, Anıl, Fırat, Osman,
Octi, Brandon, Soheil, Chhay, Reza, Xin, Cesare, Monami, Chris P. and Chris H.
Last but not least, I deeply thank my family. Unwavering support and relentless
encouragement from my fiancee Sema has helped me immensely during this journey.
I would also like to express my deepest gratitude for my beloved parents Enis and
Cigdem, and also my sister Melis, who always believed in me and always been there
to support me.
v
Abstract of “ Design-Space Exploration of Embedded Hardware Accelerators forImage Processing Applications ” by Onur Can Ulusel, Ph.D., Brown University,May 2016
Computer vision applications have gained significant popularity in their use for mo-
bile, battery powered devices. These devices range from every-day smart-phones
to autonomously navigating unmanned aerial vehicles (UAVs). While the image
processing required by these applications may be transferred to the cloud or other
off-device computing engines, because of real-time computing requirements and lim-
ited data transfer capabilities, it is desirable for computation to be handled locally
whenever possible. However, local computation can be quite challenging for mo-
bile and embedded systems due to the highly computationally intensive nature of
computer vision algorithms and require careful consideration of the target design
constraints and possible design parameters.
In this dissertation work, we first implement two real-time image processing ac-
celerators as test cases to be used for fast design space exploration: one for image
deblurring and one for block matching. For these designs, we identifiy both algorith-
mic and hardware parameters that optimize these accelerators and demonstrate the
performance, power and accuracy trade-offs in our target applications on FPGAs.
For the second part of this dissertation, we present a power and performance eval-
uation of several low cost feature detection and description algorithms implemented
on various embedded systems platforms (embedded CPUs, GPUs and FPGAs). We
present a streamlined FPGA implementation for feature detection which includes a
pre-processing stage to eliminate unnecessary computation and a computation flow
which makes maximum utilization of pixel proximity and avoids down-time after the
initial loading of image pixels. In addition, we present a combined FPGA implemen-
tation on low-cost Zynq SoC FPGAs which pipelines feature detection with feature
description, realizing increased efficiencies in performance, power dissipation and en-
ergy consumption compared to other embedded platforms. We show that despite the
high-level parallelization embedded GPU platforms like NVIDIA Jetson TK1 pro-
vide, computation of multiple kernels are highly bounded by the kernel scheduler and
memory bottlenecks reducing GPUs’ effectiveness whereas customization of FPGAs
on multiple layers can tackle operation of multiple kernels much more efficiently.
Contents
Vitae iv
Acknowledgments v
1 Introduction 11.1 Performance, Power and Accuracy Trade-offs in FPGA-based Accel-
erators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Hardware Acceleration on Low-power Embedded Platforms . . . . . . 61.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background and Previous Work 102.1 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Image Processing Applications . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Image Deblurring . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.2 Block Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . 242.2.4 Feature Description . . . . . . . . . . . . . . . . . . . . . . . . 252.2.5 Hardware Acceleration of Image Processing Kernels . . . . . . 30
3 Performance, Power and Accuracy Trade-offs in FPGA-based Ac-celerators 343.1 Modeling and Optimization Methodology . . . . . . . . . . . . . . . . 383.2 Image Processing Applications . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Image Deblurring . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.2 Block-Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.3.1 Modeling Results . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Hardware Acceleration on Low-power Embedded Platforms 644.1 Selection of Feature detection and description algorithms . . . . . . . 654.2 Platform Implementations . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . 694.2.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Summary of Dissertation and Possible Future Extensions 825.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
List of Tables
3.1 Data flow of block-matching PEs . . . . . . . . . . . . . . . . . . . . 48
4.1 Instruction number Comparison between GPU and FPGA implemen-tations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Resource utilization on the Zynq FPGA . . . . . . . . . . . . . . . . 79
List of Figures
2.1 Pareto efficiency in a design space [72]. . . . . . . . . . . . . . . . . . 112.2 Design space exploration framework proposed by Palermo et al. [51]. . 122.3 Technology mapping: (a) An original netlist (b) is segmented into
possible covering and (c) mapped into LUTs [42] . . . . . . . . . . . . 162.4 Unmanned air vehicle system to be used with our deblur accelerator. 182.5 (a) Example of a blurred image taken by aerial photography and (b)
deblured image using Landweber algorithm . . . . . . . . . . . . . . 202.6 The search patterns for (a) Three step search, (b) Diamond Search,
and (c) Hexagonal Search [59] . . . . . . . . . . . . . . . . . . . . . . 212.7 Computation of motion vectors of a given image block in a reference
frame using Block Matching algorithm [55]. . . . . . . . . . . . . . . . 222.8 Application of Block Matching algorithm over two consecutive frames
and the resulting motion vectors . . . . . . . . . . . . . . . . . . . . . 232.9 The Bresenham circle is used to determine if interest point p is a
corner feature. Figures taken from [57]. . . . . . . . . . . . . . . . . . 252.10 Various sampling patterns used for BRIEF descriptor. Figures taken
from [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.11 The sampling pattern proposed for BRISK with N = 60 points. The
blue circles corresponds to the points of interest detected by the fea-ture detector algorithm and the surrounding red circles represent thestandard deviation of the Gaussian smoothing kernel applied over theinterest points. Figure taken from [39]. . . . . . . . . . . . . . . . . . 29
2.12 Illustration of (a)FREAK sampling patterns and (b) human retina.The receptive cells in the retina are clustered into four areas withdifferent densities which is replicated in the FREAK sampling pattern.In (a), each circle represents a image block that requires smoothingwith its corresponding Gaussian kernel. Figure taken from [3]. . . . . 30
3.1 Illustration of the idea of using regression based modeling for designspace exploration and finding important designs based on objectivesand constraints. Each star on the graph on the right represents adesign variant and the dashed line represents the Pareto frontier. De-signs shown in dashed yellow boxes represent optimal designs givenby the optimization framework while the ones in blue represent thetraining set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Top-level block diagram for deblur architecture. . . . . . . . . . . . . 403.3 Architecture of a single row of pixel array . . . . . . . . . . . . . . . 433.4 Comparison of DSP pipeline depth of (a) 6 and (b) 3. . . . . . . . . . 443.5 Time-division multiplexing for a factor of 2 . . . . . . . . . . . . . . . 453.6 Top-level block diagram for block-matching architecture. . . . . . . . 46
3.7 Power measurement setup using external digital multimeter . . . . . . 513.8 Error percentage of power model over explored design space percentage. 523.9 Sensitivity of different parameters over the power estimation. . . . . . 543.10 Comparison of mean error percentage using different model fits for
power estimation, area and arithmetic accuracy models for the imagedeblur algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.11 Comparison of mean error percentage using different model fits forpower estimation, area and arithmetic accuracy and throughput mod-els for the block-matching algorithm. . . . . . . . . . . . . . . . . . . 58
3.12 Trade-off between power and arithmetic inaccuracy of the image de-blurring system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.13 Trade-off between area and power of the image deblurring system. . . 593.14 Trade-off between arithmetic inaccuracy and area of the block match-
ing system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1 The precision/recall rate and the run-time comparison of feature de-scriptors on an Intel i7 CPU. . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Flowchart for feature detection and description. . . . . . . . . . . . . 674.3 Top-level block diagram for FPGA implementation with FAST feature
detection and BRIEF/BRISK/FREAK feature description. . . . . . . 704.4 Issue Stall Reasons for FAST and BRIEF implementations on GPU . 764.5 Run-time and power results for FAST feature detection and BRIEF/BRISK/F-
REAK feature description algorithms over various embedded systems. 78
Chapter 1
Introduction
Visualization and communication technology is growing at a rapid rate. The avail-
ability of a camera and a large number of sensors in mobile devices have made
significant changes in our expectations in all aspects of our life from healthcare to
education and from defense to entertainment. Some of the technological enablers for
this change has been the rising trends in semiconductor technology, with an increase
in the number of transistors on chips in alignment with Moore’s Law [47], and com-
puter architecture, that managed to keep the power density of these chips relatively
constant in accordance to Dennard Scaling [22]. However, despite the ever-increasing
expectations of the end users and the designers, the technological advancements in
semiconductor and computer architecture technology can no longer sustain this de-
mand alone. Similarly, simply increasing clock frequencies to speed up digital designs
is becoming increasingly difficult [36, 45, 71]. As we try to transfer the state-of-the-
art computer vision algorithms designed from high-performance desktop computers
onto more modest performing, energy-efficient mobile platforms, designers are ex-
pected to increase throughput per Watt in order to achieve maximum performance
1
2
while still meeting the low power budgets [26, 6].
In the past decade, computer vision applications have gained significant popular-
ity in their use for mobile, battery powered devices. These devices range from every-
day smart-phones to autonomously navigating unmanned aerial vehicles (UAVs).
While the image processing required by these applications may be transferred to
the cloud or other off-device computing engines, because of real-time computing
requirements and limited data transfer capabilities, it is desirable for computation
to be handled locally whenever possible. However, local computation can be quite
challenging for mobile and embedded systems due to the highly computationally
intensive nature of computer vision algorithms. Even a typical digital camera cap-
turing VGA-resolution (640×480) video at a rate of 30 frames requires processing
of 27 million pixels per second [30]. In addition, limited size, weight, and battery
lifetime of these systems provide further constraints.
Reconfigurable logic, such as Field Programmable Gate Arrays (FPGA) or Graphic
Processing Units (GPU)-based embedded system solutions have been especially sought
after for real-time computer vision applications because of their high throughput and
computation capabilities. These systems are ideal for inexpensive prototyping plat-
forms in order to implement high-throughput solutions where iterative refinement
and validation of a design implementation can be performed until the desired perfor-
mance goals are achieved. Examples of such platforms are described in [36, 45, 71].
However, adding more hardware resources to solve the throughput problem may not
always lead to a feasible real-time solution.
In this thesis, we explore the design space of various computer vision algorithms
for embedded systems, namely FPGA and GPU based systems. We analyze the
impact of algorithmic and design-level implementation decisions on various metrics
3
such as throughput, power, design area and arithmetic accuracy. We observe how
different design decisions lend themselves better to certain embedded system plat-
forms and how we can generalize these techniques for efficient acceleration of other
computer vision algorithms. These generalized techniques can then be used to for-
mulate regression-based mathematical models in order to speed up the design space
exploration process by discovering optimal custom-designed solutions for specific
computer vision applications.
With this thesis work, we aim to help designers make more educated choices
during the early design phase for any given implementation. We will first try to
answer how regression based fast design space exploration models can be applied
specifically to computer vision algorithms and what types of algorithmic and archi-
tectural design decisions should be made to help with different design constraints.
Then we will expand our design space exploration to guide designers to select the
optimal embedded system platform based on algorithmic characteristics of desired
applications.
Chapter 2 provides background and related work on design space acceleration
and the acceleration of selected computer vision algorithms. Several comparisons
of embedded systems and various related work on their design space is presented.
In the following sections we discuss a number of computer vision algorithms and
implementations as well as the techniques we have developed to efficiently accelerate
computer vision algorithms for embedded systems. We discuss how we can formulate
our findings to speed up the design space exploration for computer vision algorithms.
4
1.1 Performance, Power and Accuracy Trade-offs
in FPGA-based Accelerators
The ease-of-use and reconfigurability of FPGAs makes them an attractive platform
for accelerating algorithms. Therefore, FPGA-based accelerators are widely used
in real-time image processing. With the level of customization provided via pro-
grammable logic elements, lookup tables (LUTs), Block RAMs (BRAMs), and digital
signal processor (DSP) blocks, FPGAs can achieve high throughput and computation
capabilities and provide a faster time-to-prototyping cycle compared with Applica-
tion Specific Integrated Circuits (ASICs). Due to the real-time requirements for our
high-throughput test cases, we have elected to demonstrate the performance, power
and accuracy trade-offs in our target applications on FPGAs.
We observe that FPGA-based accelerators, especially those that can be used for
image processing, offer many algorithmic and hardware design parameters, which
when properly chosen, can lead to outcomes with the desired throughput, power,
design area and arithmetic accuracy. However, compared with standard cell-based
ASICs, LUT based logic implementation is inefficient in terms of power consumption
and programmable switches have higher power consumption because of large output
capacitances. As low power has become an important design metric, designers should
now consider the impact of their design decisions not only on speed and area, but
also on power consumption throughout the entire design process [1, 21].
In Chapter 3, we discuss the exploration of algorithmic and design level decisions
we have applied for FPGA based hardware acceleration. We propose techniques
for fast design exploration and multi-objective optimization to quickly identify both
algorithmic and hardware parameters that optimize these accelerators. We also show
5
the regression based modeling we have applied to our design decision to accelerate
the design space exploration process on the given parameter space of the algorithms.
To demonstrate the effectiveness of our methodology, we have selected image de-
blurring and block-matching algorithms as our test cases for hardware acceleration
in Chapter 3. Both operations are fundamental components of many image process-
ing applications. Image deblurring is the process of restoring blurred images where
image blur is a form of bandwidth reduction caused by the imperfect nature of the
image capturing process. It can be caused by the relative motion between the cam-
era and the original scene, or by an optical system that is out of focus. Even slight
camera shakes under low light conditions may cause image blur such as atmospheric
turbulence for aerial photography [9, 75]. Therefore restoration of blurred images is
an essential initial step in many image processing applications.
Our secondary benchmark, block-matching is a sliding window operation per-
formed over video sequences and is commonly used in motion estimation and video
compression applications such as H.264 and MPEG-4 standards. Block-matching is
used to reduce the bit-rate in video compression systems by exploiting the tempo-
ral redundancy between successive frames, and it is used to enhance the quality of
displayed images in video enhancement systems by extracting the true motion infor-
mation. Finding matching frames in successive frames allows the information to be
compressed as the motion vectors and pixel intensity differences instead of sending
the raw image data [55].
We will use both image deblurring and block-matching designs to analyze the
effectiveness of algorithmic and hardware-level design choices and show the impact
of these choices on our regression based-models.
6
This work has allowed us to generate methodological formulations to predict the
impact of various design choices on the desired design metrics such as area, through-
put and power. Using such fast design space exploration techniques, custom designs
can be adjusted to target specific design constraints without designers enumerat-
ing all permutations of the design space explicitly. However the methods presented
require extensive knowledge of the potential design parameters and the target appli-
cation domain, and also limits itself to the domain of FPGAs. In our next work, we
have expanded into a different set of image processing applications that share major
components in order to explore the design space of various embedded systems.
1.2 Hardware Acceleration on Low-power Embed-
ded Platforms
Moving beyond the core image processing operations of image deblurring and block-
matching, core kernel in the next part of this thesis: feature detection and de-
scription. Feature detection and feature description are key building blocks of many
computer vision algorithms, including image retrieval, biometric identification, visual
odometry [50], object detection, tracking, motion estimation and 3D reconstruction.
Efficient feature extraction and description are crucial due to the real-time require-
ments of such applications over a constant stream of input data. High-speed com-
putation typically comes at a cost of high power dissipation, yet embedded systems
are often highly power constrained, making discovery of power-aware solutions es-
pecially critical for these systems. Therefore a computationally efficient means of
detection and analysis of image features is a critical first step in the development of
energy-efficient, single-chip solutions for these applications.
7
In Chapter 4, we introduce our comparative study of embedded platforms and
how the potential of different embedded platforms can me maximized through ap-
plication specific customization. We present a power and performance evaluation of
several low cost feature detection and description algorithms implemented on various
embedded systems. We evaluate these algorithms in terms of run-time performance,
power dissipation and energy consumption. In particular, we compare embedded
CPU-based, GPU-accelerated, and FPGA-accelerated embedded platforms and ex-
plore the implications of various architectural features for the acceleration of these
fundamental computer vision algorithms. We show that FPGAs in particular of-
fer attractive solutions for both performance and power and describe several design
techniques utilized to accelerate feature extraction and description algorithms on
low-cost Zynq system on chip (SoC) FPGAs.
In our analysis we customize off-the-shelf implementations of our algorithms of in-
terest and target both embedded CPUs and GPUs. However, our FPGA-accelerated
implementations of feature detection description take advantage of the highly cus-
tomizable logic fabric to realize significant improvements in both run-time and power
dissipation compared to other embedded solutions.
In addition, we discuss the design techniques applied to obtain high-throughput
solutions and hardware-specific power reductions. We conclude that, due to its extra
customization and flexibility, our FPGA-accelerated implementation is a promising
way forward for the development of low-power, energy-efficient platforms capable
of providing real-time performance for complex computer vision based applications
such as autonomous navigation.
Under this research we provide a comprehensive comparison between embedded
CPU, GPU and FPGA implementations of feature detection and description algo-
8
rithms, evaluating their power and performance trade-offs. We propose a streamlined
FPGA implementation for feature detection which includes a pre-processing stage to
eliminate unnecessary computation and a kernel that uses a zig-zag pattern for image
masks which makes maximum utilization of pixel proximity and avoids down-time
after the initial loading of image pixels. In addition, we propose a combined FPGA
implementation which pipelines feature detection with feature description, realiz-
ing increased efficiencies in performance, power dissipation and energy consumption
compared to other embedded platforms.
1.3 Thesis Contributions
To summarize, in this thesis we will investigate different hardware accelerator plat-
forms specifically targeted for real-time image processing applications. We will ex-
plore various options for algorithmic as well as architectural design decisions and
use them to train design space exploration models that designers can use to create
optimal accelerators under a range of constraints. We will explore different embed-
ded systems and accelerate various image processing algorithms in order to demon-
strate the applicability of algorithms for specific platforms. We will use the design
techniques we have developed with to present streamlined FPGA implementations
utilizing deep pipelining, continuous filter flow, and pre-computation steps.
Each design presents a vast number of parameters from which to select. Enu-
merating each permutation of these parameters is simply not possible. This thesis
work provides the necessary tools for a designer to make educated design decisions
during the early phase of the design process. We will show the interdependence of
design parameters and present constraint specific guidelines for our selections. Our
9
analysis covers power as well as performance driven constraints compared to other
design space exploration work found in the literature.
This thesis is organized as follows. In Chapter 2, we will present the necessary
background and related work on the field of design space exploration. We will de-
scribe the image processing applications used in this work in detail and also discuss
the previous work in literature on hardware acceleration such image processing ap-
plications. In Chapter 3, we discuss the exploration of algorithmic and design level
decisions we have applied on the unage deblurring and block-matching algorithms for
FPGA-based hardware accelerators. We also show the regression based modeling we
have applied to our design decision to accelerate the design space exploration process
on the given parameter space of the algorithms. Chapter 4 presents our comparative
study of embedded platforms and how each algorithm maps to various embedded
systems differently based on the underlying architecture of the platform as well as
the characteristics of the algorithms themselves. Finally, Chapter 5 presents our
conclusions and potential future projects that can build up on presented work.
Chapter 2
Background and Previous Work
In this chapter, we will discuss the design-space exploration for hardware accelerated
computer vision algorithms and the current trends and prospects of available systems.
We will present some fundamentals of design-space exploration and review various
techniques proposed in prior literature. We will start out by describing some of the
metrics that are important from the design point of view. We will then address
specific methodologies to obtain designs that are considered optimal in terms of
these design metrics. We will investigate specific cases of optimization done so far
for hardware accelerators, with greater focus on analytical modeling of the design
space as well as inexact circuits and approximate computing as a means for obtaining
low area/power circuit alternatives.
10
11
Figure 2.1: Pareto efficiency in a design space [72].
2.1 Design Space Exploration
Previous work on accelerating design space exploration mainly follows two different
approaches: reducing the number of configurations to be evaluated and design space
evaluation via modeling. Some of the publications that follow the former approach
include the work by So et al. [66] where design space exploration options are auto-
matically explored by their own FPGA synthesis compiler. They suggest that a key
step to fast design space exploration is to automate it using a high-level program-
ming paradigm coupled with compiler technology oriented towards FPGA designs.
Their proposed compiler tool analyses a given design and makes pre-defined trans-
formations, such as loop unrolling and array renaming, to automatically optimize
it according to compiler defined criteria. Their automated tool can find the opti-
mal design by searching through only 0.3% of the design space, yet their proposed
optimization criteria is solely driven by minimizing the execution time of the given
algorithm while staying under the area budget and can not be changed based on
designer input.
12
APPLICATION
ARCHITECTUREMAPPING
SYSTEM - LEVELEXECUTABLE
MODEL
COLLECTINGRESULTS
OPTIMIZERDESIGN SPACE DESCRIPTION
EVALUATIONFUNCTIONS
PARETOCURVE
SystemDescription
Modules Design SpaceExploration Modules
Figure 2.2: Design space exploration framework proposed by Palermo et al. [51].
Palermo et al. [51, 2] propose finding approximate Pareto points over the de-
sign space as a means for efficient design space exploration. Pareto efficiency is a
commonly used term in optimization that represents a state where it is impossible
to improve quality of a design objective without making at least one other design
objective worse off. All design permutations that are in a Pareto efficiency state are
considered to be Pareto points and the collection of all the Pareto points is called
the Pareto curve. A simple illustration of Pareto efficiency is given in Figure 2.1,
where all the red points represents Pareto points for a design space with two objec-
tives. The design space exploration framework proposed by Palermo et al. is given
in Figure 2.2. System Description Modules are the inputs to their Design Space
Evaluation Modules with the information of target design space and the applica-
tion domain. The Optimizer module selects a set of candidate optimal points to
be evaluated in terms of evaluation functions. Each selected point is mapped to a
target architecture and the evaluated using the executable model. The results are
evaluated by the Optimizer to estimate the Pareto curve using various heuristic al-
gorithms such as Random Search Pareto [76] and Pareto Simulated Annealing [19].
Despite performing a multi-objective exploration, they present a limited design space
mainly composed of transformations applied to specifically modified source code.
13
The work by Sheldon and Vahid [62] uses the Design of Experiments paradigm
to generate Pareto points which are of most interest to the designer. Design of Ex-
periments (DoE) [46] is a statistical paradigm whose objective is to design a small
set of experiments that provides maximum information on how the experimental
parameters influence the experimental output and interact with one another. There
are three statistics that can be learned through DoE; (1) positive or negative impact
of a given parameter on the output, (2) whether a parameter is beneficial to the
output, and (3) how each parameter interacts with one another. Sheldon and Vahid
proposes a DoE-based Pareto point generation using a multi-phase approach . The
first phase automatically generates a parameter interdependency graph, which is a
weighted graph whose edges show the dependencies between the parameters. Each
parameter is initially assumed to be independent and for each potential dependency,
tests are evaluated where each one of the parameters is first changed individually
and then together with the rest of the parameters. The accuracy of these estima-
tions are used to compute the pairwise edge errors and update the edge values in the
generated graph. Then the second phase of the algorithm generates Pareto points
from the weighted parameter interdependency graph starting from the node pairs
with the highest edge value. The DoE approach presented by Sheldon and Vahid is
effective for small number of parameters, however due to generating parameter de-
pendencies phase, it has a quadratic time complexity (O(n2), where n is the number
of parameters) and therefore inherently slow.
Similarly, the work by Givargis and Vahid [25, 24] proposes to find all Pareto-
optimal configurations of parameterized SoC architectures using the pre-identified
interdependencies among the design parameters using parameter interdependency
graphs. They simulate the design space of specifically embedded SoC architectures
over a parameterized SoC platform called Platune using a MIPS processor. Various
14
components of their system are configurable such as the size of caches or width
of the busses. They explore the various voltage levels of the MIPS processor as a
design parameter as well and explore the design space in terms of power along with
performance metrics. Each of the design parameters are searched exhaustively and
local Pareto points are identified. Using the pre-identified interdependencies among
the design parameters, these local Pareto points are merged to generate the system
level Pareto curve. This is one of the earliest works targeting embedded system
design space exploration and focuses on the design space of the target platforms
rather than the application domain. In addition, it relies heavily on the designer
input to generate the interdependencies of the parameter space. Our aim, in this
thesis, is to fully consider the applications themselves as part of the optimization
process.
Instead of the previously mentioned exhaustive search approaches, several ran-
domized search approaches have been proposed to find the Pareto curve of a given
design space [52, 5, 61]. These works perform design space exploration inspired
from genetic algorithms and iterate through the design space by evolving the design
parameter permutations. Each parameter permutation can be mapped as a chromo-
some whose genes define the parameters of the systems. Design space is explored via
mutation and crossover operators, where mutation refers to random modification
of a parameter and crossover is the random exchange of parameters between two
chromosomes, as in parameter permutations. Although these genetic algorithms can
explore a design space with minimal designer input, they have very long run times.
Use of analytical models using design parameters to evaluate design metrics for
a large design space have been discussed in other literature [64, 20, 29, 37]. The
works by Smith et al. and Das et al. present technology mapping models that
relate architectural parameters to the speed of FPGAs where both the architectural
15
parameters and the metric defining the speed of the FPGA is different. The process of
technology mapping is an FPGA specific problem concerning the mapping of a given
circuit netlist into lookup-tables (LUTs) as illustrated in Figure 2.3. Smith et al. [64]
present models that estimate the average post-placement pre-routing wire length of
an implementation using architectural parameters such as number and positioning of
logic blocks and pins whereas Das et al. [20] presents models that estimate the depth
of a circuit using architectural parameters such as lookup-table size, cluster size, and
number of inputs per cluster. Jiang et al. [29] use a least squares regression analysis
to estimate the power and area consumptions of specific computation units of an
implementation such as logical operators (e.g., AND/OR) and arithmetic operators
(e.g., multiply/add). Input bit widths are used as the sole design parameters for
their proposed area model, while average input transition density and average input
spatial correlation are used to generate power models. Analytical models can lead to
very fast and accurate design space exploration, however selection of the parameters
and identifying the fitting model based on parameter interactions is a crucial step
since it determines the limits of the design space that can be explored. In this
thesis work we will expand the number of parameters and design constraints that
can be used in analytical models and explore the design space with consideration to
parameter sensitivity.
Prior work has also been done on optimizing certain design metrics after co-
exploration, especially for throughput or power. For instance, Irturk et al. [27]
propose a tool that generates a variety of architectures specifically for matrix inver-
sion and find the optimum parameters for area and throughput constraints. The
approach of Chen et al. [16] aims to minimize power dissipation for an FPGA im-
plementation by doing careful allocation of functional units and registers. Other
related work by Sing and Yajun optimize the FPGA architecture for performance
16
FPGA Technology Mapping: A Study of Optimality
Andrew LingDepartment of Electrical and
Computer EngineeringUniversity of Toronto
Toronto, Canada
Deshanand P. SinghAltera Corporation Toronto
Technology CentreToronto, Canada
Stephen D. BrownAltera Corporation Toronto
Technology CentreToronto, Canada
ABSTRACT
This paper attempts to quantify the optimality of FPGAtechnology mapping algorithms. We develop an algorithm,based on Boolean satisfiability (SAT), that is able to map asmall subcircuit into the smallest possible number of lookuptables (LUTs) needed to realize its functionality. We itera-tively apply this technique to small portions of circuits thathave already been technology mapped by the best availablemapping algorithms for FPGAs. In many cases, the optimalmapping of the subcircuit uses fewer LUTs than is obtainedby the technology mapping algorithm. We show that forsome circuits the total area improvement can be up to 67%.
Categories and Subject DescriptorsB.6.3 [Hardware]: Logic Design - Design Aids
General TermsAlgorithms, Experimentation, Performance
KeywordsBoolean Satisfiability, Resynthesis, Optimization, Cone,FPGA, Lookup Table
1. INTRODUCTIONFPGAs (Field Programmable Gate Arrays) are reconfig-
urable integrated circuits that are characterized by a sea ofprogrammable logic blocks surrounded by a programmablerouting structure. Most modern FPGA devices contain pro-grammable logic blocks that are based on theK-input lookuptable (K-LUT) where aK-LUT contains 2K truth table con-figuration bits so it can implement any K-input function.Figure 1 illustrates the general structure of a 2-LUT. Thenumber of LUTs needed to implement a given circuit de-termines the size and cost of the FPGA-based realization.Thus one of the most important phases of the FPGA CADflow is the technology mapping step that maps an optimizedcircuit description into a LUT network present in the targetFPGA architecture. The goal of the technology mappingstep is to reduce area, delay, or a combination thereof in the
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DAC 2005, June 13–17, 2005, Anaheim, California, USA.Copyright 2005 ACM 1-59593-058-2/05/0006 ...$5.00.
x2
L1
L3
L4
L2
00
01
10
11
x1
Figure 1: 2-input LUT
network of programmable logic blocks that is produced. Inthis work, we assess state-of-the-art FPGA technology map-ping algorithms in terms of area-optimality. Timing-driventechnology mapping is not covered in this study.
a b c d e
f g
a b c d e
f g
a b c d e
f g
LUT LUT
(a) (b) (c)
x x
Figure 2: Technology mapping as a covering prob-lem. (a) Original Netlist (b) Possible Covering (c)LUT Mapping from Covering
The process of technology mapping is often treated as acovering problem. For example, consider the process of map-ping a circuit into LUTs as illustrated in Figure 2. Figure 2aillustrates the initial gate-level network, Figure 2b illustratesa possible covering of the initial network using 4-LUTs, andFigure 2c illustrates the LUT network produced by the cov-ering. In the mapping given, the gate labeled x is coveredby both LUTs and is said to be duplicated. Somewhat sur-prisingly, gate duplication is often necessary to minimize thearea of LUT networks [6].
The fundamental question that we ask in this paper is:Given the LUT-level network created by a technology map-ping algorithm, howmuch can its area be reduced? For smallsubcircuits, it is possible to answer this question in an opti-mal manner. Consider an arbitrary function f(i0, i1, . . . , in).Suppose that we seek to determine if it can be implementedin three or fewer 2-LUTs. This problem can be solved by
Figure 2.3: Technology mapping: (a) An original netlist (b) is segmented into possible coveringand (c) mapped into LUTs [42]
and power by allowing the designer to specify various parameters for the routing
architectures [63]. Tsoi and Luk [68] conduct power profiling and optimization for
heterogeneous multi-core systems (CPUs, GPUs and FPGAs) using on-board power
measurements. All these work specifically target either architectural or algorithmic
design parameters available in a system, and yet still need to explore a large set
of the design space for each platform to be able to perform their optimization via
interpolation of the measured data. In this thesis we expand the design space to
include architectural, algorithmic and target platform level parameters while at the
same time reducing the design space that needs to be sampled with the help of L1
regularization.
Compared to these previous techniques, our methodology for design space explo-
ration is novel in a multiple of ways. We propose an approach for model generation
using L1 regularization with traditional least squares regression. Using L1 regulariza-
tion leads to more accurate models that are identified in an entirely automated way.
We perform accelerator optimization by using the developed mathematical models
directly in numerical multi-objective optimization formulations. Also we combine
17
both algorithmic and hardware design considerations in the exploration and opti-
mization framework. We exploit the reconfigurability of FPGA platforms and tie it
with mathematical analysis for a swift, accurate and more importantly, optimizable
accelerator implementation under various objectives and constraints (e.g., power,
area, throughput, and arithmetic accuracy).
2.2 Image Processing Applications
In order to establish a preliminary understanding of the image processing algorithms
used in this thesis work. In this section, we first present an overview of these algo-
rithms. Our implementations on image deblurring and block matching algorithms
will be discussed in detail in Chapter 3 to present our design space exploration
methodology based on algorithmic and architectural design parameters of these algo-
rithms. Then in Chapter 4 we will discuss our implementations on feature detection
and description, evaluating their performance and power trade-offs for low-power
embedded systems.
2.2.1 Image Deblurring
Image deblurring is performed by a filtering operation over a given image, which is
one of the fundamental operations of image processing applications [9]. Our target
accelerator is designed to be deployed within a real-life image processing system
mounted on a unmanned aerial vehicle (UAV) system for surveillance as shown in
Figure 2.4. As the UAV platform moves, the sensor tracks a point on the ground
so that the center pixel Sc will stay fixed on ground sample Gc. For pixels away
18
G0
Gn
Gc
Sc
S0
S0S
c
Figure 2.4: Unmanned air vehicle system to be used with our deblur accelerator.
from the center, the pose and position change of the sensor means that periphery
pixel S0 is composed of a number of ground samples G0 to Gn. The deblurring
accelerator will be used to offset blur effects created during image capture mainly
caused by the shaking camera during aerial transportation. The real-life setting
of the accelerator has put tight requirements on its throughput, power, area, and
arithmetic accuracy, which motivated the need for our proposed modeling and multi-
objective optimization methodology.
Image blur can be modelled by the convolution of an unblurred image and a blur
kernel such as:
Ib(x, y) =∑dx,dy
I0(x+ dx, y + dy)H(dx, dy),
where Ib(x, y) and I0(x, y) are the deblurred and original pixel intensities at coordi-
19
nate (i, j) and H(dx, dy) is the blur kernel value. The kernel is uniquely determined
by the motion blur, which is a 2D vector. In our application different parts of image
have different amounts of blurs, and different directions. However, the blur in a given
region can be treated as uniform. In addition, although the kernel of the different
pixels are different, locally they can be approximated as the same.
One of the most commonly used solutions to image blur is the iterative Landweber
method [9, 41] which is represented as:
I0 = Ib, I(n+1) = In + αHT ∗ (Ib −H ∗ In),
where (Ib−H ∗ In) is the residue denoting the error of the deblurred image and α is
the step size. An example of using Landweber method over images captured by aerial
photography can be seen in Figure 2.5 where Figure 2.5 (a) shows the blur effects
caused by the image capturing process and Figure 2.5 (b) shows the deblurred image.
The images are displayed in various zoom levels for ease of visualization. Due to the
high throughput requirements of our application and need for heavy parallelization,
instead of using an iterative blur kernel H, we use an estimated deblur kernel K
directly such that:
I = Ib ∗K.
This concept has been proved to work for small blurs as presented in [79]. The
20
(a) (b)
Figure 2.5: (a) Example of a blurred image taken by aerial photography and (b) deblured imageusing Landweber algorithm
generation of the deblur kernel has been done in collaboration with Object Video
and our accelerator uses pre-determined deblur kernels with size up to 13×7 or 7×13
and input (Ib) and output (I0) images with 12-bit pixels. Each kernel computation
requires 91 multiply-and-add operations.
2.2.2 Block Matching
Block-matching is a sliding window operation performed over video sequences and is
commonly used in video compression applications such as MPEG-4 and H.264 [55].
Block matching partitions a given frame into non-overlapping N × N rectangular
blocks and tries to find the block from the reference frame in a given search range
that best matches the current block.
Among the block matching algorithms, the full search algorithm finds the refer-
ence block that best matches the current block among all possible locations by ex-
haustively comparing each candidate block. As a result, full search achieves the best
performance among block matching algorithms at the cost of having the highest com-
putational complexity. Various fast search algorithms have been developed for block
21
Figure 2.6: The search patterns for (a) Three step search, (b) Diamond Search, and (c) HexagonalSearch [59]
matching that reduce the number of reference block comparisons instead of perform-
ing exhaustive search. Some such algorithms use window strides larger than 1 and
switch to a stride of 1 only at their final comparison step [40, 78, 77, 7]. Three Step
Search [40] initially evaluates the reference frame exhaustively using block strides of
4. In the next step the best matching block on average is re-evaluated using block
strides of 2, and the final step re-evaluates the best matching block using strides of
1. The Diamond Search [78] and Hexagon-Based Search [77] algorithms both search
over a reference frame using strides of 2, however instead of using exhaustive search
they search over a diamond or hexagonal pattern. This allows the algorithms to
move towards a region of best potential blocks adaptively; however they might get
stuck in local minima during the search. The search patterns for these algorithms
are illustrated in Figure 2.6. Despite the potential savings in computation time, full
search has remained a popular candidate for hardware acceleration because of its
regular dataflow and good compression performance [59].
For our design, we perform full search block-matching over a search window in a
reference frame to determine the best match for a block in a current frame. As shown
in Figure 2.7, the location of a block in a frame is given using the (x,y) coordinates
22
Current Frame
fd(m,n)=a * fb(m-1,n-1) + b * fb(m,n-1) + …. + i * fb(m+1,n+1)
Search Block
Debluring Algorithm
(x,y)
Reference Frame
(x,y) -p
-p
p
p
Search Range [-p,p]
Reference Frame
(x,y)
Search Region
(x+u,y+v)
mv(u,v)
Figure 2.7: Computation of motion vectors of a given image block in a reference frame usingBlock Matching algorithm [55].
of top-left corner of the block. The search window in the reference frame is the [-p,
p] size region around the location of the current block in the current frame.
The most commonly used matching criteria are the mean square error (MSE),
sum of square error (SSE) or sum of absolute difference (SAD). The SAD approach
provides a fairly good match at lower computational requirement due to lack of a
multiplier, and because of this SAD is most commonly used for block matching [48].
The SAD value for a current block in the current frame and a candidate block in
the reference frame is calculated by accumulating the absolute differences of corre-
sponding pixels in the two blocks as shown in:
SADBmxn(d) =
m,n∑x=1y=1
|c(x, y)− r(x+ dx, y + dy)|, (2.1)
where Bmxn is a block of size m×n, d = (dx, dy) is the motion vector, c and r are the
current and reference frames respectively. SAD is an extremely fast metric due to
its simplicity. It is very effective for a wide motion search of many different blocks.
SAD is also easily parallelizable since it analyzes each pixel separately, making it
23
Current Frame Reference Frame Motion Vectors
Figure 2.8: Application of Block Matching algorithm over two consecutive frames and the resultingmotion vectors
easily implementable with hardware and software coders [34].
Since a motion vector expresses the relative motion of the current block in the
reference frame, motion vectors are specified in relative coordinates. If the location
of the best matching block in the reference frame is (x+ u, y + v), then the motion
vector is expressed as (u, v). Motion estimation is performed on the luminance (Y )
component of a YUV image and the resulting motion vectors are also used for the
chrominance (U and V ) components. An example of block matching showing two
consecutive frames of a video sequence as the current and reference frames and the
resulting motion vector for each image block is given in Figure 2.8.
For a typical block size of 16 × 16 and a reference window size ±p = 16 , the
full-search block matching algorithms requires 16 × 16 = 256 absolute difference
operations per block comparison and a total of 256 block comparisons. Given an
adder tree is being used to compute the SAD from the generated absolute differences,
(256×2+255)×256 = 196352 adder/subtractors are required per block comparison.
Therefore a wide VGA resolution (480 × 800) image would require approximately
300 Giga adder/subtructor operations to be processed.
24
2.2.3 Feature Detection
Feature detection is a low-level processing operation for identifying pixels of interest
in an image which correspond to some elements of a scene that can be reliably
located in different views of the same scene. Corners, and edges are typical examples
of features.
Previous work on feature detection algorithms primarily involve studies which
attempt to detect the highest number of valid features with the least amount of
computational effort. The Scale-Invariant Feature Transform (SIFT) algorithm [43]
is the most prominent algorithm used for feature detection and has been used as
a baseline for most feature detection algorithms since it was first published over
10 years ago. As a more efficient alternative to SIFT, both in terms of speed and
computational complexity, the Features from Accelerated Segment Test (FAST) al-
gorithm was proposed [57]. FAST uses corner-based feature detection, as opposed
to the Difference of Gaussian (DoG) approach used by SIFT and its faster successor,
the Speeded Up Robust Features (SURF) [8] algorithm. It should be noted that the
use of corner based features had been previously proposed in other widely-accepted
algorithms, such as Harris Corner detection and Smallest Univalue Segment Assim-
ilating Nucleus (SUSAN) corner detection [65]. However FAST was the first corner
based feature detector to greatly reduce the computation requirements of feature de-
tectors, achieving a speedup of 169× over SIFT and 89× over SURF algorithms [44].
Analysis done by Canclini et al. [13] on low complexity feature detectors demon-
strates definitively the strength of corner based feature detectors over DoG based
detectors .
Corner-based feature detectors are derived from the idea of finding rapid changes
in direction on image edges to determine a unique image region of interest. In order to
25
Figure 2.9: The Bresenham circle is used to determine if interest point p is a corner feature.Figures taken from [57].
identify whether a pixel p with an intensity value Ip is a corner, the FAST detector
analyzes a 16 pixel Bresenham circle surrounding p. The Bresenham circle is an
approximation of a circle around the center pixel, as shown in Figure 2.9. A positive
detection is identified if n points of this circle form a contiguous segment, which
is either darker or brighter than the center pixel, within a pre-defined threshold T .
Algorithm 1 shows the pseudo code for FAST.
Algorithm 1: FAST Feature Detection
1 For each pixel p in an image, assume the intensity of the pixel to be Ip ;2 Define a threshold intensity value T ;3 Define a 16 pixels Bresenham circle of radius 3 centered around p, where each
pixel corresponds to I1, I2, ..., I16 ;4 for each i ∈ Ii do5 if Ii...i+12 + T < p ∨ Ii...i+12 − T > p then6 p is a corner ;7 end
8 end
2.2.4 Feature Description
Feature extraction involves computing a unique and identifying descriptor from the
pixels in the region around each point of interest. Descriptors are used to uniquely
identify each feature and match regions of features between two or more images.
The SIFT and SURF algorithms also generate descriptors for the features they
26
detect. These descriptors are represented by Histograms of Gradients (HoG). As
more efficient alternatives, algorithms such as BRIEF [12] and BRISK [39] use bi-
nary feature descriptors, which have the advantage of significantly decreasing the
complexity of feature matching by allowing the use of Hamming Distance to mea-
sure similarity between two descriptors.
As an alternative to storing the feature descriptors using Histograms of Gra-
dients (HoG) as in the case of SIFT and SURF, various binary feature descriptor
algorithms have also been proposed, such as BRIEF [12], BRISK [39] and FREAK [3]
to further improve the computation efficiency of the algorithms. The use of binary
feature descriptors have decreased the feature matching computation requirements
significantly by enabling the use of Hamming Distance to measure similarity of any
given two features. As a result, BRIEF feature descriptors can be computed 118×
faster than SIFT and 31× faster than SURF descriptors [44]. Using the hamming
distance as a distance measure between two binary strings, matching between two
patch descriptions can be done using a single instruction, as the hamming distance
equals the sum of the XOR operation between the two binary strings.
Binary descriptors are composed of 3 main sections: a sampling pattern, orienta-
tion compensation, and sampling pairs. A region centered around a detected corner
needs to be described as a binary string. Given a sampling pattern, pick N pairs of
points on the pattern and determine whether the first element or the second element
of the pair is greater than the other and define the pair as binary 1 or 0 correspond-
ingly. The resulting N bit vector is the feature descriptor for the said point to be
used for feature matching.
BRIEF is the first binary descriptor published. It has a simple pattern and does
not offer orientation invariancy for the detected features. The pseudo-code for the
27
Algorithm 2: BRIEF Feature Description
1 For each interest point p in an image, define a region of interest S × Scentered around p ;
2 Apply a gaussian smoothing filter over the region to reduce the camera noise. ;3 Use any of the sampling patterns given in Figure 2.10 to generate a pair of
arrays Xi and Yi, where i ∈ 0, ...., N . ;4 for each i ∈ Xi do5 if Xi > Yi then6 Di = 1 ;7 end8 else9 Di = 0 ;
10 end
11 end
algorithm is given in Algorithm 2. The descriptors are computed usingN neighboring
pixel pairs around the given feature location, denoted as Xi and Yi. The resulting N
bit descriptor vector is computed by 1-to-1 comparisons of Xi and Yi and denoted
as Di. Due to the use of raw pixel intensities for each pixel, a smoothing filter needs
to be applied to the image as a pre-processing step. The BRIEF algorithm presents
5 different methods to select the vectors X and Y as visualized in Figure 2.10 and
described as follows:
I Xi and Yi are randomly and uniformly sampled around a pre-defined region
S × S centered around interest point p.
II Xi and Yi are randomly sampled using a Gaussian distribution of distances
to interest point p, such that points close to the center are more likely to be
selected as a sample pair.
III Xi is first randomly sampled using a Gaussian distribution of distances to
interest point p, then Yi is randomly sampled using a Gaussian distribution of
distances to the pairing on Xi.
IV Xi and Yi are randomly sampled from discrete locations of a coarse polar grid.
28
V For each i, Xi is (0, 0) and Yi takes all possible values on a coarse polar grid.
Figure 2.10: Various sampling patterns used for BRIEF descriptor. Figures taken from [12].
Once the N -bit descriptor vectors are identified, the number of different bits
between any two feature vectors is calculated as the difference between features.
BRIEF differs from BRISK by the custom sampling pattern applied to compute
the binary descriptor vector. Each sample pair corresponds to a concentric ring
as shown in Figure 2.11, where red circles represent the standard deviation of the
Gaussian smoothing kernel applied over the interest points.
Unlike BRIEF, BRISK is an orientation invariant feature descriptor, meaning
that it estimates the orientation of the interest point from the selected sampling
pairs and rotates the sampling pattern to neutralize the effect of rotation. This is
done by distinguishing the sampling pairs as short pairs and long pairs, where long
pairs are used to determine orientation and short pairs are used for the intensity
comparisons that build the descriptor, as in the BRIEF algorithm. Short pairs
are pairs of sampling points that that have distance below a certain threshold and
long pairs are pairs of sampling points that have distance above a certain different
threshold.
29
Figure 2.11: The sampling pattern proposed for BRISK with N = 60 points. The blue circlescorresponds to the points of interest detected by the feature detector algorithm and the surroundingred circles represent the standard deviation of the Gaussian smoothing kernel applied over theinterest points. Figure taken from [39].
The orientation of the sampling pattern is estimated by summing all the local
gradients of the long pairs and computing the ratio of the y component of the local
gradients over the x component of the local gradients. Short pairs are used similar to
BRIEF to generate the N bit descriptor vector where the distance of two descriptor
vectors are computed by the use of XOR operation between the vectors.
Similar to BRISK, the FREAK algorithm also uses a custom sampling pattern
that is based on the model of a human retina. As shown in Figure 2.12, the suggested
sampling pattern corresponds to the distribution of receptive regions over the retina.
This results in a higher density of points near the center (feature coordinate), with
exponentially decreasing density as one moves away from the center.
Unlike the other two descriptors, FREAK sampling pairs are learned by maxi-
mizing the variance of the pairs and selecting uncorrelated pairs. This process results
in 4 sampling patterns of 128 pairs each. All 512 sampling pairs need to be evalu-
ated in order to generate the descriptor; however, descriptor matching for FREAK
30
Figure 2.12: Illustration of (a)FREAK sampling patterns and (b) human retina. The receptivecells in the retina are clustered into four areas with different densities which is replicated in theFREAK sampling pattern. In (a), each circle represents a image block that requires smoothingwith its corresponding Gaussian kernel. Figure taken from [3].
descriptors can be applied over each pair sequentially and if the distance between
two pairs is larger than a given threshold, the subsequent pairings do not need to be
further evaluated.
The orientation computation of FREAK descriptors is very similar to BRISK.
The only difference is that FREAK uses a pre-defined set of orientation pairs instead
of long pairs, as done for BRISK.
2.2.5 Hardware Acceleration of Image Processing Kernels
As mentioned earlier in Chapter 1, multimedia applications running on portable
devices have gained huge popularity in 21st century. This increased demand also
comes with higher expectations in application accessibility and quality. Therefore
hardware acceleration is an important tool for computer architects to create high-
throughput computer vision systems that can sustain high image resolutions in real-
time. Thankfully, computer vision kernels are highly data parallel and map well
to streaming architectures. In this section we will discuss some of the hardware
accelerated computer vision systems proposed in literature.
31
Different computational architectures present different opportunities to accelerate
feature detection and descriptor generation. This is particularly true of feature
detectors where every pixel is examined in order to determine if it meets the criteria
of a feature. These detectors tend to have few data dependencies, limited branching,
and few control dependencies. As noted in by Che et al. [15], kernels with these
characteristics are strong candidates for GPU acceleration.
Feature descriptor generation, however, requires computation across non-contiguous
image sub-regions. This results in more irregular memory access patterns compared
to feature detectors. In addition, feature description algorithms tend to have higher
complexity and require pre-computation with less data parallelism, but which can
benefit from design techniques such as pipelining. This class of algorithms is a
stronger candidate for acceleration via FPGAs as compared to GPUs.
There has been a significant amount of research into hardware acceleration of
feature detection algorithms. For instance, Bouris et al. [10] and Svab et al. [67]
both presented FPGA implementations for the SURF algorithm. Both works em-
phasize the power efficiency gained over GPU implementations; however they fall
short in terms of comparing FPGA run times to GPU versions [17]. Indeed, GPU
implementations in literature, such as [73] and [53], show the performance poten-
tial of feature detection working on high-end GPUs. Hongtao et al. [73] presents
a Harris-Hessian corner detector optimized for CUDA GPUs that can reach up to
20× faster computation speeds than CPUs. They also compare their implementa-
tion with SURF and report 1.25× speed-up. Phull et al. [53] also proposes a novel
low cost corner based detector algorithm optimized for CUDA GPUs that can run
14× faster than CPU implementations of the algorithms and 2× faster than Harris
corner detector implementations for GPUs. However, these works do not discuss the
power and energy repercussions of using such high-end GPUs, which dissipate power
32
in the range of 200W, and the challenges of running these algorithms on tightly
constrained embedded platforms. We emphasize that for embedded platforms, both
runtime performance and power need to be optimized.
On the other hand, a fully customized ASIC implementation of SURF on a 28nm
CMOS was recently presented [28], where a very low power dissipation of 2.8mW
was reported. Despite the high computation requirement of the SURF algorithm
due to the use of the DoG approach, this work demonstrates the gains that can be
achieved by hardware-oriented design optimizations.
There exists other publications in the literature that suggest the use of FPGAs
for feature detection applications. Schaeferling and Kiefer [60] presents a SoC fea-
ture detection system that uses a Xilinx Virtex 5 FPGA to accelerate SURF feature
detector algorithm. They propose a Flex-SURF+ customizable accelerator IP that
can be easily configured for algorithmic parameter changes in their design. Wal et
al. [70] presents a distributed feature detector algorithm implementation for low-cost
Zynq FPGAs. The proposed distributed feature detector first finds all features with
strength above a certain threshold, and then divides the image into multiple small
tiles. Then the best features from each tile are returned irrespective of the relative
strength of features between the tiles. Their approach doesn’t return the strongest
features over an image, but a well-distributed set of features. In addition, an FPGA
implementation for FAST feature detection was proposed by Kraft et al. [33], where
a baseline FAST implementation is presented. In the work by Rublee et al. [58],
the FAST feature detector is extended by adding the rotational BRIEF feature de-
scriptor to form the Oriented FAST Rotated BRIEF (ORB) feature detection and
description algorithm. Lee et al. [38] and Kulkarni et al. [35] each presented com-
plete hardware implementations for ORB, which emphasized run-time performance
gains over their hardware accelerated SIFT and SURF counterparts. The work done
33
by Fularz et al. in [23] goes a bit further and implements a hardware accelerated
architecture for real-time image feature detection (FAST), description (BRIEF) and
matching (Hamming distance) on an FPGA. Their performance measurements, how-
ever, are limited to runtime in terms of frames per second, and resource utilization
(LUTs, FFs, and BRAMS) in one embedded environment. In one of the more recent
publications on hardware accelerated feature detection, Chang et al. [14] presents an
implementation of Harris corner detection on an IBM POWER8 system integrated
with Altera FPGAs. While all of these works offer promising results, they all ignore
power aspect of the presented systems and do not provide any analysis of power
dissipation.
As observed from all these literary work, acceleration of image processing algo-
rithms is an emerging topic of interest with a wide range of application domains
and varying design constraints. In order to create constraint optimal accelerators,
a very large design space needs to be considered. In the following chapters, we will
first present configurable accelerator designs for image processing that can be used
to explore this design space and come up with mathematical formulations to speed
up the design space exploration. Then we present multiple implementations using
the know-how we have established for efficient acceleration design and explore the
design space of various embedded platforms for such algorithms.
Chapter 3
Performance, Power and Accuracy
Trade-offs in FPGA-based
Accelerators
Chapters 1 and 2 discussed the main motivations for design space exploration of
hardware accelerated systems for real-time image processing applications. We have
presented the current trends in literature for design space exploration and also defined
the image processing algorithms we will be using as test cases for acceleration in this
study. Before looking into various embedded systems as a design space, in this
chapter we discuss our regression based technique for fast design space exploration
and multi-objective optimization for FPGA-based hardware accelerators.
FPGA-based accelerators are becoming widely used in real-time image process-
ing, for applications in, scientific research, smart camera technologies and automotive
industries [31], among others. FPGAs are ideal for inexpensive prototyping platforms
34
35
to implement high-throughput solutions. Their reconfigurability allows for iterative
refinement and validation of a design implementation until desired goals are achieved.
With programmable logic elements, registers, lookup tables (LUTs), Block RAMs
(BRAMs), DSP blocks, and digital clock managers, FPGAs, by themselves or as
parts of a heterogeneous system, have the capability of parallelizing algorithms on
various hardware modules, making them superior in instrumenting tasks that require
high throughput.
Many of these high-performance platforms are also used in highly resource con-
strained environments where reduced power consumption becomes imperative. As
such, care must be taken to increase parallelism while at the same time minimiz-
ing energy consumption. Indeed, simply adding more hardware resources (whether
through FPGA logic or other computing fabrics) to solve the throughput problem
will not necessarily lead to a feasible solution for power/energy constrained systems.
Fortunately, we have observed that FPGA-based accelerators, specially those that
can be used for image processing, offer many algorithmic and hardware design pa-
rameters, which when properly chosen, can lead to outcomes with the desired design
metrics of throughput, power, design area and arithmetic accuracy.
While having flexibility in both algorithmic and hardware design parameters
will increase the possibility of creating hardware accelerators that meet all design
constraints, it raises the question of how one should go about discovering an optimal
design among all possible design choices. Indeed, even the choice of a relatively few
parameters can lead to hundreds or even thousands of designs. Therefore, effective
design space exploration techniques are critical for efficiently navigating the large
design space. To speed up design exploration, we propose to sample the large design
space and then use regression models and statistical inference from the samples to
create mathematical models that estimate the target metrics over the entire design
36
Objectives
Constraints
Predictive Modeling
of the Rest of the
Design Space
Optimization
Framework
Training Samples
Optimal Design
Arithmetic Accuracy
Pow
er
Design 1(x1, y1, z1)Area = A1
Power = P1
Design 2(x1, y1, z2)Area = A2
Power = P2
Design 6(x2, y1, z1)Area = A6
Power = P6
Design 7(x2, y2, z1)Area = A7
Power = P7
Design N(x2, y2, zN)Area = AN
Power = PN. . .
Design 3(x1, y2, z2)Area = A3
Power = P3
Design 4(x1, y2, z1)Area = A4
Power = P4
Design 5(x2, y1, z2)Area = A5
Power = P5
Power = f(x1, x2, …, xn)
Figure 3.1: Illustration of the idea of using regression based modeling for design space explorationand finding important designs based on objectives and constraints. Each star on the graph on theright represents a design variant and the dashed line represents the Pareto frontier. Designs shownin dashed yellow boxes represent optimal designs given by the optimization framework while theones in blue represent the training set.
space.
Our approach aims to identify both algorithmic and hardware parameters that
optimize hardware accelerators. This information is used to run regression analysis
and train mathematical models within a non-linear optimization framework in order
to identify the optimal algorithm and design parameters under various objectives and
constraints. To automate and improve the model generation process, we propose the
use of L1-regularized least squares regression techniques. We implement two real-
time image processing accelerators as test cases: one for image deblurring and one
for block matching.
This work has been done in collaboration with my colleague Kumud Nepal. Im-
37
plementation of parameterized hardware accelerators and selection of algorithmic and
hardware parameters that optimize our accelerators will be the main focus of this
thesis. We will then discuss how these parameters were applied for our L1-regularized
design space exploration. More details on the modeling and multi-objective optimiza-
tion methodologies can be found in our previously published work [69] and Kumud
Nepal’s PhD Thesis [49].
A simplified illustration of our problem statement and proposed solution is pre-
sented in Figure 3.1, where we have a system with three design parameters: x, y
and z. The total design space for this system consists of any permutation of these
three design parameters. Each design dissipates power differently and has a certain
arithmetic accuracy. Assume we are interested in figuring out which design gives
us the best trade-off between power and accuracy, i.e. we want to achieve a system
that dissipates as little power as possible while its accuracy is still maintained at an
acceptable level. To find the optimal design variants, the results from the predicted
accuracy and power metrics from regression modeling are fed into an optimization
framework. The optimization framework presents a subset of those variants that
create a Pareto frontier (dashed green line on the graph), where the frontier points
do not dominate each other in both accuracy and power, but dominate other non-
frontier points. The Pareto frontier points represent the optimal trade-off between
arithmetic accuracy and power. It is up to the designer to pick from these opti-
mal designs (marked as dashed orange boxes in the design space) depending on the
allowed accuracy and/or power budget.
38
3.1 Modeling and Optimization Methodology
A hardware accelerator design implementation of an algorithm can have tens of al-
gorithmic and physical design parameters, with potentially a large range of possible
values for each parameter. The combinations of these parameters create a design
space that grows exponentially as a function of the number of parameters. As a
result, explicit enumeration of every possible design choice is impossible, as it entails
creating, compiling, and programming the design for every combination choice in the
register-transfer level (RTL) flow. Nevertheless, designers are interested in exploring
this design space in order to identify the optimal values for these design parame-
ters that meet the target metrics such as throughput, power, area, and arithmetic
accuracy.
To evaluate the design metrics for any combination of parameters, the models
are queried with the values of the parameters. Following Design of Experiments [46]
techniques, it is important to sample a small design space but in a uniform way to
capture the essential features of the design. We do this by selecting design combi-
nations randomly from within the design space. We incorporate possible minimum
and maximum configurations of each of the parameters in our training samples so
that we consider the full range of the design space. In this way, the models out-
put predictions that span the range of possible designs such that our optimization
framework can identify the configurations that lead to optimal designs.
These sample combinations are implemented in the design and the resultant
metrics characterized from real measurements (e.g., the deblur accelerator) and/or
from synthesis tool results (e.g., block matching accelerator). The characterized
results are then used as a training set to generate the scalable models.
39
Once the best model representing the design objective is obtained, we are able
to estimate each of our design metrics (e.g., power, area, arithmetic accuracy) for
a given set of design parameters. These models can also be used with non-linear
optimization formulations with an objective and under certain model constraints,
giving designers the ability to target their design with a focus on a desired design
variable. Such multi-objective optimization problems can be solved using standard
non-linear optimizing techniques, as presented in the work by Byrd et al. [11].
To demonstrate the effectiveness of our methodology, we focus on algorithms
relevant for image processing as test cases due to their high suitability and adoption
for acceleration in FPGAs.
3.2 Image Processing Applications
To use as test cases for our methodology, we have implemented two image processing
applications for FPGAs: image deblurring and block matching. We have identified
several design parameters, both algorithmic and architectural, and created parame-
terized architectures for each system. These designs were realized on and optimized
for the Xilinx Virtex 6 FPGA which impacted our design options especially for
the architecture parameters such as the frequency limitations of the dedicated DSP
blocks.
Image deblurring and block matching algorithms use sliding window operations
in their core. For both of our implementations, we have operated over still im-
ages initialized BRAMs. The image pixel intensity data is transferred to the image
deblurring and block matching designs row by row as a stream.
40
data_in [11:0]
kernel_in [17:0]
data_enable clock
reset
Kernel Control
13x7 Kernel Buffer
12-‐Line Line Buffer
….
Adder Tree
….
data
_out
[11:
0]
Processing Element
X
X
X
13x13 Pixel Array
Control
Address Generator
kernel_sizeX [2:0]
kernel_reinst
kernel_sizeY [3:0]
….
Figure 3.2: Top-level block diagram for deblur architecture.
3.2.1 Image Deblurring
Images are produced to store and display useful information. However captured
images almost always represent flawed replicas of the original scene due to the im-
perfect nature of capturing process. Image blur is one of the major degradations
caused during this process and therefore image deblurring is an essential component
of many image processing application. Image deblurring is performed by a filtering
operation over the image. The accelerator is deployed within a system requiring
real-time image processing (for instance, mounted on an unmanned aerial vehicle
(UAV) used for surveillance). The real-time processing requirements of the accel-
erator in this environment has put tight constraints on its throughput, power, and
area; especially since the accuracy of the image deblurring algorithm itself must be
kept within acceptable margins. The need to meet all the requirements motivated
our proposed modeling and multi-objective optimization methodology.
Our implementation was targeted for an ultra high-throughput application where
41
each frame is divided into 368 sections of smaller images with pixel resolution
2592×1944. With a required 10 frames per second rate, 18.5 GigaPixels per second
must be processed. Each input pixel is 12 bits. The kernel size can vary throughout
the computation of a single frame from 3×3 to 13×7 and each kernel data is 18 bits.
This system was targeted to be run on a highly parallelized platform with 20 FPGAs
where we implement the prototype system running on each FPGA. In order to ac-
commodate the require processing constraints, we have designed our accelerator to
process 8 pixels per second at a frequency of 125MHz, providing a processing power
of 1 GigaPixels, which can reach the target constraint with a theoretical 20 GigaPixel
throughput running over 20 FPGAs.
The high-throughput requirement of our application required us to create a highly
parallelized architecture, which was enabled due to the sliding window structure of
the image deblurring alogorithm. We have used the dedicated DSP blocks on the
FPGA to perform the multiply-add operations required for each kernel multiplica-
tion.
The block diagram of the deblurring hardware is given in Figure 3.2. The image
pixels and kernel values stored on BRAMs of a Xilinx Virtex 6 FPGA board are fed
into the deblur system as video streams to be stored in line buffers. The image values
and the kernel values are then processed in the processing elements (PEs) and the
output is stored in yet another set of BRAMs. Dedicated DSP units on the FPGA
are used as PEs. The kernel control module reads in the kernel values and updates
the kernel buffer accordingly and the control module reads in the video stream data
and feeds the data samples into the pixel array.
The deblurring algorithm uses masks that have a maximum size of 13×7 which
requires being able to access 13 rows of data at a given time. The input data in
42
provides data from a row of the video frame; however, the previous 12 rows need to
be stored as line buffers. The most outdated line buffer is updated with the input
row information at the same time. This architecture deblurs 8 pixels/cycle running
at 125 MHz on our FPGA board.
The block diagram for a single row of our pixel array is given in Figure 3.3. Each
BRAM is connected to a 12-by-1 multiplexer that feeds the incoming pixel stream
to a single line of the register array. The multiplexers are controlled by the modulo
12 row counter to accommodate for the changing tag of each BRAM as new row
data is acquired. Border replication method [54] is used to handle the boundary
conditions. The boundary conditions in our system are handled using the border
replication method [54]. The border replication method requires extending the image
boundaries with a copy of the closest pixel intensity. Extension of boundaries on the
y-axis of an image is performed by the use of boundary select and boundary address
signals showed in Figure 3.3. Boundary address always corresponds to the BRAM
holding either the first or last row of an image and is selected when the current deblur
mask goes over the image boundaries on the y-axis. Replication of boundaries on
the x-axis of the image is handled by broadcasting the first pixel of each image row
to all registers during computation. To the to be able to sustain constant 8-pixel
throughput in our register array, we utilize 2 delay registers and 6 temporary registers
along with 20 (13 + 7) registers directly connected PEs. The delay and temporary
registers are used to make sure 2 cycles of 8 pixel inputs provide enough data to fill
up our PEs for computation at the beginning of a row of pixels.
Our image deblurring accelerator has a number of algorithmic and hardware
design parameters, the values of which determine its final metrics (e.g., power, de-
sign area, and arithmetic accuracy). Parameters that are chosen by the designer
are expected to have an impact on the constraint metrics and thus their selection
4313x13 Register Array
• Single row of the register array is given above, total of 13 rows are used.
• A non-uniform 13x13 register array is used as illustrated on the left. This allows easy transition between 13x7 and 7x13 kernels.
- registers that feed the Processing Elements- 6 temp registers are used to handle boundary conditions
- apart from temp registers, 16 (13+3) or 20 (13+7) registers are used for 4 and 8 pixel throughput respectively
BRAMs
boundary_addrrow cnt%12
boundary_select
en_temp_reg
- extra registers needed to handle 4 and 8 pixel throughput
Figure 3.3: Architecture of a single row of pixel array
requires an understanding of the inherent nature of the algorithm and the design
(e.g., parameters that affect the number of critical resources in the hardware or the
accuracy of the algorithm in the software). However, the designer does not need to
understand exactly how these parameters may affect the design constraints. Our
goal is to simplify the parameter selection process by enhancing the least squares
based modeling methodology by an L1 regularization process. L1 regularization, as
we have previously mentioned, suppresses irrelevant parameter and interaction terms
between them and selects only those that have an impact on the constraint metrics,
thereby highlighting design choices that may not have been obvious to the designer.
The parameters selected for image deblurring test case are as follows:
Kernel Bit-Width (algorithm parameter): The fixed point kernel bit-width is
explored in the range of 8 to 18 bits as a design choice. Different bit-width selections
do not have any effect on the area and throughput of the design due to the fixed
width allocated for DSP inputs; however, both power and accuracy of the design
varies with different bit-widths.
Kernel Size (algorithm parameter): The kernel size used in the design can be
dynamically changed to be of any size up to 13×7. Any kernels smaller than 13×7
will be padded with zeros so that the kernel input for corresponding DSP blocks is
44
DSP
DSP
DSP
DSP
DSP
DSP
DSP
partial sums
delay registers
: DS
P DS
P
DSP
DSP
DSP
DSP partial
sum
(a) (b)
Figure 3.4: Comparison of DSP pipeline depth of (a) 6 and (b) 3.
equal to zero and any switching activity due to changing pixel values do not propa-
gate through the processing elements. Thus both accuracy and power of our design
vary with changing kernel size but area will remain unaffected.
DSP Pipeline Depth (design parameter): The architecture needs to account
for the maximum possible kernel size of 13×7; therefore, a total of 13 × 7 = 91
multiplications need to be performed. However this processing element array can
be implemented using DSP pipelines of varying depths. As the depth of each DSP
pipeline decreases, the number of required DSP groups increase to perform the same
number of computations. The smaller pipeline depths require fewer delay registers
for synchronization but use extra DSP slices for the addition of computed partial
sums, as illustrated in Figure 3.4. We use the average DSP pipeline depth as a design
variable.
Time-Division Multiplexing (design parameter): The provision that DSP
blocks are allowed to run at potentially faster clock frequencies than the rest of the
45
pixelA
add
pixelB
output
kernel
clk
pixelA
add
pixelB
output
kernel
clkpixelA_set1
pixelA_set2
pixelB_set1
pixelB_set2
pixelA_set1
pixelA_set2
pixelB_set1
pixelB_set2
clk clk_dsp
Time Multiplexing DSPs
0
1
0
1
0
1
0
1
DSP
DSP
Figure 3.5: Time-division multiplexing for a factor of 2
FPGA system enables time-division multiplexing of image data to be employed [74].
For time-multiplexing factor of n, n sets of pixels need to be available every system
cycle and the DSP blocks process n sets sequentially at an increased frequency. The
kernel inputs to the DSP blocks are constant and do not need to change in order to
compute different sets of image masks. Figure 3.5 illustrates the DSP usage for the
case when the time-division multiplexing factor is 2. This design choice can either
lead to an increase in system throughput by a factor of 2 or a decrease in the number
of DSPs used for computation by half. For our work, prefer decreasing the number of
DSPs rather than increasing throughput when applying time-division multiplexing.
Our experiments allowed this multiplexing parameter to be 1, 2, or 4. Higher levels
of time-division multiplexing are not used due to achievable frequency limitations
within the DSPs.
46
current block
search block
reset
Search Window Buffer Co
mparator
Min
SA
D v
ecto
r
PE Array
start
clock
PE PE PE
PE PE PE
PE PE PE
PE PE PE
Adder Tree
Figure 3.6: Top-level block diagram for block-matching architecture.
3.2.2 Block-Matching
We have chosen the block matching algorithm to further test our design space ex-
ploration methodology. As detailed in Chapter 2, the highly computation intensive
and data-parallel nature of block-matching provided to be a very good match for our
hardware acceleration. We have selected to use CIF (352×288) size images as our
benchmark [4] and kept throughput as a design constraint dependent on the design
parameters.
The block diagram of the block-matching hardware is given in Figure 3.6. Each
block from the current frame is loaded into the processing elements (PEs). Meanwhile
the search window is loaded into a buffer to be read by corresponding PEs. Each PE
calculates the absolute difference between a pixel in the current block and a pixel in
the search window block. The SAD of a search location is calculated by adding the
absolute differences calculated by PEs using an adder tree.
The implemented hardware traverses the search locations in the search window
47
row by row in a zigzag pattern. This allows continuous computation of SADs for
each block of the search window no matter the direction of the search by utilizing the
overlap of search blocks within a given search window. For each new search window,
16 new pixels are required. This zigzag flow is enabled by the use of PEs that are
capable of shifting data up, left and right, and also by the use of 16 8-bit temporary
registers that store the values for an up shift of the search block. These temporary
registers are also filled using the corresponding values of the search window.
For a block matching algorithm using block sizes of N×N and search range [−p, p],
the dataflow of the PEs using zig-zag flow is shown in Table 3.1. Sx,y is a search
window pixel and current block pixels are not shown in the table since they are
not modified (e.g. PE0,0 stores C0,0) throughout the operation of the current search
block. The search window buffer feeds new pixel information to the first column of
PEs every cycle. For a block size of N×N, N cycles are required to fully fill up the
PEs as the pixel information is prorogated within each row of PEs. An extra BRAM
is used to perform a reverse shift in the PE array, after all the search locations
in a column are searched in N + 2p cycles. The extra BRAM is connected to the
temporary registers and by the time the search window needs to move in the opposite
direction, the temporary registers have the necessary pixel data to shift through the
PEs. This enables the zig-zag search flow to proceed uninterrupted during vertical
shift of the search window.
Similar to the image deblurring implementation, we have identified a number of
algorithmic and hardware design parameters that create our design space.
Pixel Truncation (algorithm parameter): The SAD calculation is approximated
by the pixel truncation parameter. By truncating the least significant bits (varied
48
Table 3.1: Data flow of block-matching PEs
ClockCycle
First Row of PEs. . .
Last Row of PEs Temp ColumnPE1,N . . . PE1,1 PEN,N . . . PEN,1 FFN . . . FF1
0 S1,1
. . .
nop
. . .
SN,0
. . .
nop SN+1,0
. . .
nop1 S1,2 nop SN,1 nop SN+1,1 nop. . . . . . . . . . . . . . . . . . . . .N S1,N S1,1 SN,N SN,1 SN+1,N SN+1,1
N+1 S1,N+1
. . .
S1,2
. . .
SN,N+1 SN,2 SN+1,N+1
. . .
SN+1,2
N+2 S1,N+2 S1,3 SN,N+2 SN,3 SN+1,N+2 SN+1,2
. . . . . . . . . . . . . . . . . . . . .N+2p S1,N+2p S1,2p SN,N+2p SN,N+2p SN+1,N+2p SN+1,N+2p
N+2p+1 S2,N+2p
. . .S2,2p
. . .SN+1,N+2p
. . .SN+1,2p nop
. . .nop
. . . . . . . . . . . . . . . . . . . . .N+4p S1,N S1,1 SN+1,N SN+1,1 SN+2,1 SN+2,1
. . .
from 0 to 5 bit truncation) of the pixels in image blocks, a reduction in the hardware
area and power dissipation is achieved in both SAD computation (PEs and adder
tree) and comparison blocks. However this comes at a cost of arithmetic accuracy.
Search Window Size (algorithm parameter): We define a search window of
size ±p×p if the maximum and minimum distance from the current block is bounded
by ±p on both x and y axis. By increasing the search window size, we allow the
design to find a potential best matching block further away from the original block,
however this comes at the cost of decreased throughput due to the increased number
of search locations.
Block Size (algorithm parameter): The size of non-overlapping N ×N rectan-
gular blocks (N) is defined as an algorithm parameter. Larger block sizes require
evaluation of a larger area per comparison however reduces the number of blocks to
be evaluated per frame. The granularity of the comparison directly affects the accu-
racy of the system and also provides a trade-off between area and power dissipation.
49
Number of Processing Elements (design parameter): We vary the number of
PEs used to compute the SAD for a single N×N block. Since the number of absolute
difference computations per block is fixed at N × N , we can have a maximum of
N ×N PEs. By reducing the number of PEs by a power of 2, we end up requiring
more cycles to compute the SAD of a given block, thus allowing a trade-off between
the area and throughput of the implementation.
3.3 Experimental Results
Our hardware accelerator prototypes use a 40 nm Xilinx XC6VLX240T FPGA with
240,000 logic elements and 768 DSP blocks. Xilinx ISE Design Suite 12.4 is used
for physical synthesis and Mentor Graphics Modelsim 10.1b is used for functional
and timing simulations of the design. MATLAB is used for regression and optimiza-
tion. To evaluate our accelerator performance for the image deblurring system, we
use a number of sample images that are captured from the aerial vehicle platform.
For evaluation of the block matching architecture, results are computed using the
Foreman sequence [4]. The design metrics are estimated as follows:
• Throughput: For the deblur design, throughput is measured in terms of the
number of pixels deblurred per cycle. In all our design variants, we ensure the
design meets an operating frequency of 125 MHz and 8 pixels/cycle deblur. We
relax this limitation for the block-matching algorithm and make throughput
a variable that depends on design parameters. Throughput for this particular
example is measured in terms of frame rate — how many frames we can perform
block matching on, per second.
50
• Area: Since DSPs are the most critical resource, area for the deblur example
is measured by the number of DSP blocks used by the accelerator. For the
block matching architecture, since DSPs are not used, we measure the area
metric using total number of LUTs. Each logic element in the FPGA used is
composed of 8 registers and 4 LUTs. For purpose of uniformity, we convert
the number of registers to equivalent LUTs and use the total number of LUTs
as our measurement for area of the design.
• Accuracy: To estimate the inaccuracy of a particular deblur accelerator vari-
ant, we compute the mean square error (MSE) between a sample image data
and its deblurred result from the accelerator. The MSE is the average of
the squared differences between the image pixels and its deblurred result as
produced from the accelerator variant. In the case of the block-matching al-
gorithm, we use MSE between reference block and the current block relative
to the results obtained from the base implementation: 32×32 window size, no
pixel truncation and 64 PEs.
• Power: To estimate the power dissipation of the accelerator, we followed two
approaches. The first approach executes Modelsim on the routed design to
obtain signal activity values and then feeds these values to the Xilinx XPower
tool to estimate power. The second approach measures the incremental power
consumption of our prototype board directly using an external digital multime-
ter (e.g., Agilent 34410) where the incremental power is the difference between
the reset state power and the execution state power of the design. Our setup
is displayed in Figure 3.7. The first approach, which estimates power from
timing simulation information, computes the power dissipated by the archi-
tecture only, while the second approach gives more representative results as it
accounts for all the additional system power (e.g., FPGA and memory) that
51
Figure 3.7: Power measurement setup using external digital multimeter
is associated with the computations of our accelerator. The incremental sys-
tem power is the real cost that the end user incurs. For variation, we use the
XPower results with our block matching architecture analysis and the board
results for the deblur architecture. Consistency within each example ensures
that the validity and accuracy of our methodology is not compromised.
Experimental results for each of our test cases are discussed hereafter:
3.3.1 Modeling Results
Image Deblurring
For the image deblurring accelerator, we implement the design parameters as men-
tioned in Section 2.2.1. We use factors of 1, 2, and 4 for time-division multiplexing
which correspond to a DSP clock frequency of 125, 250 and 500 MHz. We also make
four different choices of average DSP pipeline depths between 3.3 and 11.5. These
depths are calculated by dividing the total number of DSPs used in all pipeline blocks
by the number of blocks used. For kernel bit-width, we vary the parameter from 8
bits to 18 bits. We also pick four random kernel sizes between 5×3 and 13×7. The
52
0
20
40
60
80
100
120
140
0.15 0.2 0.25 0.3 0.35 0.4 0.45
Mean Error %
Explored Design Space %
linear interac3on quadra3c purequadra3c L1-‐regularized
Figure 3.8: Error percentage of power model over explored design space percentage.
combinations of parameters create a design space with 3 × 8 × 11 × 45 = 11,880
possible design points that potentially lead to different accelerator variants.
Full physical synthesis (which includes placement and routing) of an accelerator
variant takes about two hours on our quad-core based system, which puts limitations
on the ability to execute a brute-force exploration of all accelerator variants. This
motivates the need for fast design space exploration and optimization. To obtain
our samples, we fully synthesize and implement 50 deblur accelerator variants with
different parameter permutations; i.e., we only sample 5011880
= 0.42% of the entire
design space. These design points are selected randomly across the entire design
space with the condition that the minimum and maximum configuration for each
parameter is used at least once. This guarantees that any data point estimated by
our predictor lies within the space covered by the training set, and that the training
set is representative of the entire design space.
We first analyze the precision of different regression models in estimating power,
area, arithmetic accuracy, and throughput against the measurements we obtained
53
from our samples. We show the comparison of the precision of the models as an
illustration of how our approach stands against traditional regression models (linear,
quadratic, etc.). All these models without L1 regularization have similar run times
(i.e., less than 0.2 seconds). The additional step of finding the right λ parameter
that minimizes the prediction error in the L1 regularization methodology involves
searching from a range and trying out all possibilities before a specific value is chosen.
In our approach, we varied λ from 0.0001 to 1000 and randomly picked 25 values
within this range. Of these 25 values, we chose the one that leads to the lowest
error. This whole process took about 25 mins; however, it should be noted that
this a one-time overhead required for accurate model generation. Once the model
is derived, the process of physically implementing and synthesizing a design can be
eliminated — thus allowing for what would take two hours to be completed in less
than a second.
To train and evaluate the aforementioned model, we split our samples into two
subsets: a training subset is used to learn the model parameters, and a query subset
is used to validate the closeness of the model to the true measurements by taking the
average absolute error between the model predictions and the actual measurements.
To evaluate the quality of the results, we follow the repeated random sub-sampling
validation methodology [32] and repeat our training and query set selection 100
times so that any training bias is eliminated. For the purpose of this particular
implementation, we randomly chose 35 samples as training and the remaining 15 as
query for each iteration. Evaluation of predicted values from the model is validated
by averaging over these 100 runs.
To gain further insights into the effectiveness of various models, we evaluate the
relation between the mean error generated by the different models as a function of
the training subset size. For example, we plot the results for the power model used in
54
2 31
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
time−division multiplexing factor
Pred
icte
d Po
wer
(W)
4 6 8 10average DSP pipeline depth
12 14 16kernel bit−width
40 60 80kernel size
Figure 3.9: Sensitivity of different parameters over the power estimation.
the deblur example in Figure 3.8. The plot shows that the model obtained using L1-
regularization requires a slightly larger training set to stabilize due to the presence of
higher order terms. However, it performs better and stabilizes after exploring only
0.3% of the design space is explored for both accelerator setups.
The model coefficients obtained from L1-regularization reveal the impact and sig-
nificance of different algorithm-design variables on the final outcome of the design.
However, because of the presence of quadratic terms and interactions, numerical
comparisons are not straightforward. To carry out an accurate evaluation, we con-
sider the sensitivity of the L1 model to different variables using MATLAB’s response
surface modeling toolbox. We plot the results, again as given for the power model
for the deblur example in Figure 3.9. The solid line represents estimated power for
each parameter variation, given that all other parameters are kept constant. The
dotted lines show the 95% prediction bands for the power value estimated, showing
that if the prediction was repeated with different samples, the estimated value would
lie within the range specified 95% of the time. It can be inferred from the picture
that average DSP pipeline depth has the highest sensitivity on power estimation,
and the power values vary the most for this parameter. The trend observed for
55
17.56%
22.55%
14.76%
14.25%
7.48%
16.23%
19.69%
2.52%
3.22%
2.38%
9.22%
9.83%
11.08%
10.72%
9.22%
0%
5%
10%
15%
20%
25%
linear interac6on quadra6c purequadra6c l1-‐regularized
Mean Error %
Model Fits
Power Model
Area Model
Accuracy Model
Figure 3.10: Comparison of mean error percentage using different model fits for power estimation,area and arithmetic accuracy models for the image deblur algorithm.
DSP pipeline depth reflects the trade-off obtained by varying the parameter as the
smaller pipeline depths requires larger DSP groups to perform the same number of
computations with fewer delay registers for synchronization. Therefore both small
and large DSP pipeline depths benefit from this trade-off from different ends in terms
of power. The time-division multiplexing factor affects power the most after aver-
age DSP pipeline depth, followed by kernel bit-width and size, which have similar
impacts as time-division multiplexing. Time-division multiplexing has a quadratic
relationship with respect to power which is caused by the trade-off between the num-
ber of DSPs used and their running frequencies. Lower values require more DSPs,
while larger time-division multiplexing factors require fewer DSPs running at higher
frequencies. As expected, kernel bit-width has a linear interaction with power, since
larger complexity in pixel arithmetic always results in higher power dissipation. Ker-
nel size also has a quadratic relationship with power and power saturates for very
large kernel sizes. The sensitivity results also align with the individual trends we
observed in our measurements.
56
The estimation accuracy of the different models used for power, area and arith-
metic accuracy metrics is given in Figure 3.10 for a training size of 36, 9, 23 samples.
The results show that the models obtained from L1 regularization outperform other
models.
Block Matching
Similarly, for the block-matching accelerator, we use multiple search window sizes of
±4×4 to ±32×32. The number of lower significant bits truncated from pixel values
for computation defines our second design parameter. We vary this from 0 to 5. For
the block size (N ×N), we use 4× 4, 8× 8 and 16× 16 as three different sizes; and
for the number of PEs, we use N × N , N×N2
and N×N4
. As in the deblur test case,
combination of these parameters create a design space with 29 × 6 × 3 × 3 = 1566
possible variants of the accelerator.
Likewise, we only synthesize 18 variants, and hence sample a mere 181566
= 1.1%
of the block-matching design space and use sets of training and query data points to
model regression behaviors and cross-validate their performance.
Predictably, the model coefficients obtained from L1-regularization, as seen in
Figure 3.11, are superior and predict power values very close to those obtained
using XPower generated data for all metrics (power, area, arithmetic accuracy and
throughput) than any other model tried.
The L1-regularized model coefficients provide insight into the interaction of de-
fined parameters with the constraint metrics. In this example, the parameters and
interactions that were dominant in our model for each metric are as follows:
57
• Throughput: As expected, the two parameters that are dominant for this
metric are window size and the number of PEs. The terms that involve only the
number of PEs have the most impact among all parameters, with the quadratic
term of number of PEs having a larger weight than any other term. These are
followed by the pair-wise interaction terms between the number of PEs and
the window size.
• Area: The area model is dominated by the number of PEs and the pair-wise
interactions involving it. It is interesting to observe that the contribution of
pixel truncation and window size is mainly through their pair-wise interactions
with number of PEs.
• Accuracy: The quadratic term for the search window parameter and its high-
order interaction with pixel truncation are the main contributors to the ac-
curacy model. Similar to the area model, it is surprising to see that pixel
truncation doesn’t impact accuracy on its own but rather mainly through in-
teraction with the window size.
• Power: The only parameter that influences power alone is pixel truncation
and the dominant terms observed are all interaction terms. The term with
the largest weight is the pair-wise interaction between window size and pixel
truncation followed by the pair-wise interaction between pixel truncation and
number of PEs.
It is observed that the non-suppressed terms that appear for each model match
our expectations. However, certain interaction terms have a much larger impact on
the models than may be intuitively obvious to the designer. This demonstrates the
usefulness of our methodology at automatically detecting the combined effects of
parameters on design constraints without relying on user input.
58
16.82%
'
20.25%
'
16.75%
'
11.28%
'
9.85%'
5.11%' 3.97%'
4.50%'
4.72%'
3.56%'
7.29%'
7.41%'
7.63%'
8.70%'
5.55%'12.47%
'
13.72%
'
8.41%'
11.19%
'
3.80%'
0%'
5%'
10%'
15%'
20%'
25%'
linear' interac6on' quadra6c' purequadra6c' l1<regularized'
Mean%Error%%
%
Model%Fits%
Power'Model'
Area'Model'
Accuracy'Model'
Throughput'Model'
Figure 3.11: Comparison of mean error percentage using different model fits for power estimation,area and arithmetic accuracy and throughput models for the block-matching algorithm.
0.7
0.8
0.9
1
1.1
1.2
1.3
0.0006 0.00065 0.0007 0.00075 0.0008 0.00085 0.0009 0.00095 0.001
Po
we
r (W
)
Arithmetic Inaccuracy (MSE)
timemux:4,bit-width:18, kernel 13x13
timemux:1,bit-width:14, kernel 13x13
timemux:1,bit-width:8, kernel 13x13
Figure 3.12: Trade-off between power and arithmetic inaccuracy of the image deblurring system.
Case Study Results
The mathematical models obtained through L1-regularization enable us to create a
numerical optimization framework to optimize the accelerator designs with respect
to certain selected metrics while imposing constraints on other metrics. These chosen
metrics and their values are selected to optimize the design within the specifications
59
50
60
70
80
90
100
110
120
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
# o
f D
SPs
Power (W)
pipeline depth= 5.3 pipeline
depth = 6.3pipeline
depth = 7.0
Figure 3.13: Trade-off between area and power of the image deblurring system.
of its target deployment. We have verified the optimization methodology presented
by Kumud Nepal [49] by applying the parameters detected by the optimization
framework on our target accelerators.
Figure 3.12 shows the data points from our design space for the image deblurring
example. Each data point corresponds to a set of parameters of a design yielding
the least amount of power for a given arithmetic inaccuracy constraint. The data
points marked with red filling are the Pareto points of our design space as there is
no point on the design space that can improve either the arithmetic inaccuracy or
the power dissipation without making the other design metric worse. We observe
that for the image deblurring algorithm, arithmetic inaccuracy is heavily dependant
on the kernel bit-width parameter whereas higher time-multiplexing results in lower
power dissipation when the full bit-width of the 18-bit kernels are used.
Similarly, Figure 3.13 shows the data points from our design space that reflect
the trade-off between area and power dissipation of the image deblurring design in
60
0
2000
4000
6000
8000
10000
12000
14000
0.0006 0.0206 0.0406 0.0606 0.0806 0.1006 0.1206
Are
a (L
UTs
)
Arithmetic Inaccuracy (MSE)
pixel truncation: 0search window ±4x4
64 PEs
pixel truncation: 5search window ±4x4
64 PEs
pixel truncation: 4search window ±4x4
64 PEs
Figure 3.14: Trade-off between arithmetic inaccuracy and area of the block matching system.
terms of the pipeline depth parameter. The data points with red fillings represent
the Pareto points on our design space and each Pareto point corresponds to a dif-
ferent pipeline depth value for our system. Having this behavior represented in our
prediction modeling framework, a designer can identify the area vs. power trade-
off that is most suitable for a given application domain and use the corresponding
pipeline depth as a parameter.
We observe the trade-off between area and arithmetic inaccuracy for our block
matching design in Figure 3.14. The sole parameter that has control over this trade-
off is pixel truncation since the data points that result with minimum area for a
given arithmetic inaccuracy constraint always share the same value for the rest of
parameters. Each Pareto point marked with red filling on the plot corresponds to
a different pixel truncation parameter, therefore a designer needs to only adjust the
value of pixel truncation to generate a design with the required area vs. arithmetic
inaccuracy trade-off.
61
Significance of Results
Compared to previous regression techniques proposed in the literature, L1 regular-
ization enables automatic discovery of the exact mathematical dependency between
the variables and the desired model outcome with no need for guesswork from the
designers. This exact dependency results in improved correspondence between the
results of the model and the actual measurements.
These benefits of automated model generation and the resulting correlation of our
models to the measurements enables us to use this model to query directly for non-
sampled design points obtaining a dramatic speed-up in design space exploration.
Since a full synthesis, plus place and route for our deblur design takes two hours,
and our model achieves its least error using only 35 samples (0.3%) of the full 11880
points design space, our L1 based model is able to achieve approximately a 340×
speedup in design exploration with estimation errors of 7.48%, 2.38% and 9.22%
for power, area, and accuracy respectively. With the block matching algorithm, we
use only 18 samples from the entire design space for training, so our speedup is
approximately 90×. We report speedup for both examples as the ratio of the time it
would have taken to implement and synthesize the entire design space to the time it
takes to implement and synthesize only a few sample points for training and query.
The non-linear optimization framework implemented in MATLAB after the models
are generated for each metric takes about 0.1 seconds to run and is almost negligible
in comparison to the runtime taken for synthesis of the designs.
Our optimization framework presents us with the ideal parameters to adjust in
order to obtain the trade-offs between two design constraints. Using this information,
we are able to identify the range of the design metrics most crucial to our target
application.
62
3.4 Summary and Discussion
In this chapter we explored, via two different architecture setups, techniques for fast
design space exploration for FPGA-based accelerators. Reconfigurability of FPGAs
enabled us to implement a small fraction of a large design space and apply regression
analysis to obtain analytical information and formulate scalable models to predict
values for various design metrics such as power, arithmetic accuracy, performance,
and area. We proposed automatic techniques to devise the best model using re-
gression analysis such as the L1-regularized least squares estimation. We created
a case study for an image deblurring accelerator. For the accelerator design, the
proposed models predict the implementation metrics within 8% of measured power
values, 10% within the output arithmetic accuracy, and within 3% of actual FPGA
resources used. We also studied a block-matching accelerator as a second test case
and introduced system throughput as an additional variable dependent on the de-
sign parameters. This second accelerator design confirmed our findings about the
benefits of the L1-based modeling methodology. Our predictions were again fairly
close to the actual measurements of the metrics — within 10%, 4%, 6%, and 4% for
power, area, arithmetic accuracy and throughput respectively. With these accurate
models in hand, we are able to expedite immensely the design space exploration pro-
cess - a 340× speedup for the image deblurring test case and a 90× speedup for the
block matching test case were achieved. We are also able to come up with numerical
optimization formulations that give directly the optimal design parameters under
various objectives and constraints.
Despite the huge speedups we have gained in terms of design space acceleration
and constraint driven optimization, the very long synthesis and placement times
to gather our samples were the limiting factors in generation of our models. More
63
importantly, time to prototype these designs is fairly long mainly due to the waveform
simulation based verification and debugging of these implementations. During the
course of this work, we aimed to test our experience in the domain of power aware
implementation of image processing algorithms on FPGAs by entering the first series
of the currently annual low-power image recognition challenge (LPIRC). Due to the
tight schedule of the competition we had to forgo our FPGA implementation and
proceed to the competition with a software solution. We have observed that none
of the FPGA focused teams managed to come up with a working system by the
time of the competition and the competition was heavily dominated by groups using
embedded GPUs, specifically the NVIDIA Jetson TK1. The GPUs performed very
well in terms of power efficiency compared with low-power CPU based systems, and
provided an easy to implement solution compared with hardware accelerators.
Given our experience participating in the low power image recognition challenge,
going forward, we have decided to explore the design space of other low-power em-
bedded platforms, mainly the embedded GPUs, in order to better understand their
impact on area, runtime, and power dissipation of various algorithms.
Chapter 4
Hardware Acceleration on
Low-power Embedded Platforms
In the previous chapter, we presented an approach for design space exploration using
analytical models. Our design space is composed of design configurations that use
both algorithmic level design parameters (e.g., input bit-width and kernel size) and
hardware level design parameters (e.g., time-division multiplexing and DSP pipeline
depth) for FPGA based accelerators.
In this chapter, we present a comparative study of feature detection and descrip-
tion algorithms across various embedded platforms. We evaluate these algorithms in
terms of run-time performance, power dissipation and energy consumption. In par-
ticular, we compare embedded CPU-based, GPU-accelerated, and FPGA-accelerated
embedded platforms and explore the implications of various architectural features
for the acceleration of these fundamental computer vision algorithms.
Feature detection and description algorithms form the basis for the majority of
64
65
1
10
100
1000
0
10
20
30
40
50
60
70
80
SIFT SURF BRIEF BRISK FREAK
Run-
tim
e (m
s) -
log
scal
e
Prec
isio
n-Re
call
(%)
Precision Recall Run-time
18
Figure 4.1: The precision/recall rate and the run-time comparison of feature descriptors on anIntel i7 CPU.
present-day computation-intensive computer vision applications such as 3D mapping,
object detection and tracking, and motion and camera pose estimation. This section
provides an overview of these algorithms, a discussion of their amenability to hard-
ware acceleration on three different low-power embedded systems (ARM CPU, GPU
and FPGA), and an overview of the metrics utilized to characterize the performance
of hardware-accelerated detection and description algorithms.
4.1 Selection of Feature detection and description
algorithms
Feature descriptors based on Histogram of Gradients (HoG) such as SIFT and SURF
require computing the gradient of the image in the region of each feature, which is
a very costly process. The SURF algorithm speeds up this process via the use of
integral images; however, it is still not efficient enough to be used for real-time
embedded applications.
66
As shown in Figure 4.1, SIFT feature descriptors perform best out of commonly
used feature descriptors in terms of precision and recall rates, where precision is the
number of relevant detected features over the total number of detected features and
recall is the number of relevant detected features over the total number of relevant
features. However, when comparing the run time of HoG-based and binary feature
descriptors, we observe that SIFT (as a HoG-based descriptor) has a computation
time 2 orders of magnitude greater than that of binary descriptors (540ms vs. 3.5ms
for BRISK, the slowest binary descriptor). The SURF algorithm, which is also HoG-
based, improves computation time through the use of integral images; however, it
is still an order of magnitude slower than the binary descriptor algorithms which is
insufficient for real time use.
The flowchart of the feature detection and description framework is given in
Figure 4.2. The program starts by reading the input frame from the memory. The
FAST feature detection algorithm is then applied over each pixel (p) of the frame
as a sliding window operation as detailed in Chapter 2. Despite the non-standard
mask of the FAST algorithm, the Bresenham circle, where immediate neighbors of
the center pixel are not used for computation, spatial locality can still be used for
parallelization of the algorithm. The high spatial locality nature of sliding window
operations lend themselves to better parallelization since a pixel is more likely to be
accesses when its neighboring pixel has already been accessed, therefore each memory
read for a group of pixels is more likely to access multiple data for immediate use.
Both instruction-level and thread-level parallelization techniques can be applied over
the algorithm. The pixels over the Bresenham circle are transferred into 16 element
arrays where instruction-level parallelism is utilized to compare the values over the
circle with the center pixel. Once the comparison binary array is generated, existence
of a continuous string of 12 bits is checked. If this check returns true, then the center
67
Figure 4.2: Flowchart for feature detection and description.
pixel is declared a corner feature. This process is repeated for each pixel of the input
frame and can be parallelized over multiple computation units or threads. Once a
certain p is found to be a corner, N sampling pairs Xi and Yi (∀i ∈ N) surrounding
p are evaluated to describe the feature using an N bit descriptor vector D. Each bit
of vector D is assigned true (1) or false (0) based on evaluation of Xi > Yi.
68
4.2 Platform Implementations
We have targeted three distinct low-power embedded platforms for evaluating the
various feature descriptor and detector algorithms. For the low-power GPU and
embedded CPU platforms, we used the Jetson TK1 development kit. The Jetson
board has a 28nm Tegra K1 SoC with an integrated Kepler GPU with 192 CUDA
cores that run at 950MHz and a quad-core ARM Cortex A15 CPU that runs at
2.5GHz. The system has 2GB on board memory. For the FPGA design we use a
MicroZED development board featuring a 28nm Zynq 7020 SoC, which integrates
an Artix-7 FPGA with a dual-core ARM Cortex A9 CPU, and a 1GB DDR3. This
platform is logically divided into a Processing System (PS) side containing the ARM
CPUs and the Programmable Logic (PL) side with the FPGA and associated support
logic. The DDR3 in the FPGA is used to store input image data, output coordinates,
and feature descriptor vectors. A 32-bit AXI Central Interconnect module is used
to interface our custom IP module with the DDR3 via the ARM AMBA AXI4-Lite
protocol standard at an effective frequency of 111MHz.
The Zynq FPGA uses a bare-metal configuration, and one of two available ARM
Cortex-A9 CPUs was used for debugging and initialization purposes. The Zynq
FPGA monitors the address space of the DDR3 via the Memory Interface to read
and verify its contents while our design is running. Our custom IP module and
AXI interconnect module are located on the PL side. The memory interface which
directly connects the ARM CPUs to the DDR3 is located in the PS side. The
different image processing algorithms were simulated and implemented on Vivado
2015.2 IDE and run on the Zynq MicroZED board using Xilinx SDK. The GPU
and embedded CPU implementations were tested using the Ubuntu Linux for Tegra
distribution and OpenCV version 2.4.10. All of our implementations were tested
69
using an 800× 480 Wide VGA (WGA) image resolution which is commonly used for
high-quality hand-held devices and CMOS image sensors used in robotics.
Both the Zynq and Tegra SoCs utilize the same critical feature size (28nm), thus
making the architectures the primary differentiator in our experimental evaluation.
4.2.1 FPGA Architecture
The block diagram of the FPGA hardware is given in Figure 4.3. We rely on the
AXI4-Lite protocol to transfer data between our custom IP and the DDR3 where
our image input data and output memory reside. Depending on the algorithm under
evaluation, the custom IP module contains the Verilog implementation of FAST,
the integrated versions of FAST+BRIEF, FAST+BRISK or FAST+FREAK. In our
design, the Zynq processing system has the dual role of initiating the main system
clock and monitoring the contents of the DDR3. The custom IP module behaves as a
master and initiates memory mapped reads and writes to the DDR3, which behaves
as a slave in accordance to the AXI4-Lite protocol.
To initiate a memory read, our custom IP waits for the DDR3 to assert a ready
signal. This signal is not asserted every clock cycle; hence, to guarantee a continuous
flow of input data from the DDR3 to the custom IP, our block diagram, shown in
Figure 4.3, uses a 10-line word buffer to deliver buffered pixel data at a rate of
1 byte per clock cycle as a continuous stream input to FAST, FAST+BRIEF or
FAST+BRISK. The output coordinates and/or descriptor vectors produced by the
algorithms are then written to the DDR3 one word at a time.
FAST feature detection uses the Bresenham circle mask to traverse an image
70
!"#$%&
'#(#&
)*$+)#$&
,-#.%-/*0
$&
1*0%&234%-/&
5#/6&7*8%
&&
9%$*/(%
-&
:--#
;&
234%-&:
''-%//&<
%0%-#(=-&#
0'&
9%$*/(%
-&:--#
;&&
?-%+@=
"A3(%&3
0*(/&
BC3#D*(;&
E&FG&
>&+
H*-@D%
&
H="A#-#(=-&
E&J&K&&&&@==-'*0#(%/&
B0#LD%&
J&7;0@&
/*$0#D/&
:95&H=-(%I&:M&H?N&
H%0(-#D&!0(%-@=00%@(&
OGL&<?&:E!&5#/(%-&?=-(&
FP+1*0%&Q=-'&234%-&
?-=@%//*0$&7;/(%"&R?7S&
H=0(-=D&
?-=$-#""#LD%&1=
$*@&R?
1S&
*/T@=-0%-&
7"==(U*0$&J&
9%$*=0&<%0%-#V=0&
&WFG+Q
*'%&
@="A#-#(=-&
X-*%
0(#V=0&
H="A%0/#V=0&
'%/@-*A(=-&
:E!&!0(%-@=00%@(&>
&
+
!"#$%&
'#(#
)*$+)#$&
,-
/*
5#
*8%&&
9%$*/(
234%-&:
''-%//&<
%0%-#(=-&#
0'&
9%$*/(%
-&:--#
;&&
H*-@D%
&
E&J&K&&&&@==-'*0#(%/&
B0#LD%&
FP+1*0%&Q=-'&234%-& H=0(-=D&Q=-'&234
'#(#
J&7;0@&
/*$0#D/&
)*$+)#$&
5#/6&7*8
5#
BC3#D*(;&
E&FG&
>>+
(#,-#.%-/*0
$&
1*0%&234%-/&
9%$*/(%
-&
:--#
;&
(#&
0%-
*/T@=-0%-
7"==(U*0$&J&
9%$*=0&<%0%-#V=0&
9%
&WFG+Q
*'%&
@="A#-#(=-&
FG+Q
*'
X-*%
0(#V=0&
H="A%0/#V=0&
*%0(#V
'%/@-*A(=-& K&&&
>>
+
5%"=-;&
!0(%-Y#
@%&
:
5%"=
ZZ9O&
!0(%-Y#
@
ZZ9O
=-(%I&
! ( -!0(%-
&:E!&5#/
!0(%-
&:E!&5#/
!0(%-@=00%@(
H=0(-=
Figure 4.3: Top-level block diagram for FPGA implementation with FAST feature detection andBRIEF/BRISK/FREAK feature description.
71
frame. Even though each mask operation uses 16 pixels, the whole mask size is
7×7, and the whole mask region must be provided to the FAST feature detection
computational logic to sustain a single pixel per cycle throughput, requiring a high
memory bandwidth. However, on an FPGA implementation, available logic elements
can be freely traded for additional storage or computation, allowing us to create line
buffers to reuse overlapping pixels for subsequent masks and effectively increasing
the bandwidth of our computational logic.
Each of the line buffers are used to store the contents of a single row of the input
image data using address accessible 1-D register arrays. For a mask size of N ×N ,
N line buffers are utilized. Each subsequent row of the input image overwrites the
contents of the oldest line buffer.
The image corresponding to the 7×7 mask size of the Bresenham circle is stored
in a 7×7 register array made up of shift registers. With each pixel read from the input
image, the bottom row of the register array is updated with the new data, shifting
all the other pixel values horizontally. Meanwhile all other rows of the register
array are updated reading the corresponding pixel value from the corresponding line
buffer. The matching of the line buffers with the row of the register array is done
by utilizing pointers that keep shifting whenever a new line buffer is being written.
This architecture enables us to reduce the memory bandwidth requirement of the
FPGA design to one pixel per cycle despite the size of the computational mask.
For the reading of the line buffers, we have applied a zigzag access pattern similar
to the implementation used for our block matching hardware presented in Chapter 3.
This approach allows continuous computation of the mask by utilizing the overlap of
pixels in a given image block even through row changes. The filtering starts from the
top left pixel location of the input I and proceeds to the following pixel on the same
72
row until the last possible location of this row is filtered. Then, the filter operation
continues with the last pixel location of the next row and proceeds with the previous
pixel until the first search location of this new row is parsed. Only as many pixels
as the width of the filter are required by the computation units in each cycle to
calculate the result of filtering regardless of its position in the image. This zigzag
flow is enabled by the use of computation units that are capable of shifting data up,
left and right, and also by the use of temporary registers that store the values for an
up shift of the image block.
For each valid state of the 7×7 register array, the pixels that correspond to the
Bresenham circle around the center pixel are evaluated. An initial comparison of 4
pixels intersecting the axis of the circle (labeled as 1, 5, 9 and 13 in Figure 2.9) is
performed to cut down the total number of comparisons, as described in [56]. If this
pre-computation returns false, then there is no need to further compare the remain-
ing 12 pixel points of the Bresenham circle since there is no chance of obtaining a
continuous 12 pixel ring around the center. The reduction in the number of compar-
isons does not impact the delay of the computation since the pixels do need to be
registered in either scenario to sustain constant throughput, however the reduction
of redundant comparison operations and bit propagation reduces the dynamic power
dissipation of the circuit.
The feature description in our Zynq FPGA system has been implemented and
evaluated with three different feature descriptor algorithms, BRIEF, BRISK and
FREAK. All three algorithms share the same basic framework presented in Fig-
ure 4.2. A region centered around a detected corner p needs to be described as a
binary string. Given a sampling pattern, the feature descriptor generates N sam-
pling pairs for each detected feature and determines whether the first or the second
element of each pair is greater than the other and defines the pair as binary 1 or 0
73
correspondingly. The resulting N bit vector D is the feature descriptor for the said
point to be used for feature matching. Here N is equal to 512 for all of our descriptor
implementations.
The descriptor algorithms are implemented as additional pipeline stages in the
FAST feature detection implementation. The line buffers used in FAST are incre-
mented in number to cover 31 rows of input image data and are utilized for both
the detection and description phases. Once the current pixel coordinate is computed
to be a feature, pairing samples are generated from a 31 × 31 sampling window
around its location, where the sampling window is stored in a 31×31 register array
and traversed in a zigzag manner using the technique presented with block matching
design in Chapter 3.
The sampling pairs for descriptor implementations are fixed according to the
algorithm descriptions. Therefore the pairs can be compared to generate the 512-bit
binary feature descriptor vector directly from the 31×31 register array. This register
array needs to be updated with each new data read, however feature description
computation only takes place after a pixel coordinate is already computed as a
feature coordinate.
Unlike BRIEF, the BRISK algorithm is an orientation invariant feature descriptor
and estimates the orientation of detected features from the selected sampling pairs
and rotates the sampling pattern to neutralize the effect of rotation. The sampling
points are distinguished as short pairs and long pairs based on their distance to each
other, where long pairs are used to determine orientation and short pairs are used
for the intensity comparisons that build the descriptor. For BRIEF, a preliminary
smoothing operation is applied over the whole image I before the sampling pairs are
chosen and compared. The smoothing is applied over the 31×31 sampling window
74
for each detected feature.
4.2.2 GPU Architecture
The Tegra K1 SoC combines four ARM Cortex A15 CPU cores with a Kepler class
GPU with 192 CUDA cores on the same die. The Kepler GPU has a separate 64kB
on-chip L1 cache and a 128kB on-chip L2 cache, while the CPUs and GPU share
2GB of off-chip memory. GPU programming with the Tegra K1 is dominated by two
factors common to all GPUs. The first is the highly parallel SIMD nature of GPU
programming. The second is the unique memory model.
The highly parallel nature of GPU programming makes large amounts of branches
and control logic expensive to implement. GPU threads on the K1 are organized into
blocks that can access global GPU memory or a memory space shared only within
the thread block. Shared memory on the K1 is implemented with the 64kB L1 cache
so most memory accesses for the kernels under study will be to the 2GB off-chip
global memory space.
We implement our GPU kernels using version 2.2.10 of the OpenCV computer
vision library and NVIDIA’s CUDA API. The implementations of FAST feature
detector and BRIEF and BRISK feature descriptors for both the CPU and GPU
OpenCV library are publicly available and highly optimized by the community to be
used for performance critical applications. We have excluded the analysis of FREAK
descriptor on the GPU due to lack of an optimized GPU implementation.
The FAST feature detector kernel can be mapped almost entirely to the Kepler
GPU. This kernel contains a modest amount of branches and no loops. However,
75
the feature descriptor implementations of BRISK, and BRIEF have large sections of
initialization code which does not map well to a GPU architecture because of the
non-uniform access pattern to the image data. These code sections are executed on
a single ARM core, limiting the achievable speedup. It is possible that some initial-
ization code can be mapped to other CPU cores, thus achieving a small additional
speedup. However, the additional speedup was judged to be too modest to warrant
inclusion in our analysis.
4.3 Results
The run-time, power, and energy comparisons of the FAST feature detector with
and without the BRIEF/BRISK feature description algorithms on various embedded
systems is given in Figure 4.5. As mentioned earlier, both the CPU and GPU
implementation were run using the Jetson TK1 development board. Jetson contains
an 192-core NVIDIA Kepler GK20a GPU, which can process upto 300 gigaflops, and
ARM quad-core Cortex A15 CPU running at 2.3GHz. The FPGA implementation
results were taken from the MicroZED development board running at 111 MHz. For
all platforms, we have measured the power by intercepting the current between the
power supply and the system by a 1 mΩ shunt resistor. The voltage across the
resistor is measured using a multi-meter to calculate the total power of each system.
As a sliding window operation, FAST has a highly predictable data access pattern
traversing each input image frame. On the other hand, feature description algorithms
traverse over a list of detected features and accesses the sampling window around
each feature coordinate, resulting in an irregular access pattern. This irregularity
impacts the run-time of the CPU the most, whereas the GPU architecture can handle
76
0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0%
ExecutionDependency
Pipe Busy
MemoryThrottle
Not Selected
Stall Reasons during GPU Computation
BRIEF Feature Description FAST Feature Detection
Figure 4.4: Issue Stall Reasons for FAST and BRIEF implementations on GPU
irregular accesses to a 2-D data relatively well. For the FPGA implementation, the
flexibility to trade-off resources for performance allows us to fully pipeline detection
and description computation, completely eliminating the need for additional memory
accesses thereby retaining the same throughput and thus very similar run-times.
We have used the NVidia Visual Profiler [18] to identify the performance bottle-
necks for our GPU implementations and analyze the underlying reasons for execution
stalls. Figure 4.4 displays the source of execution stalls for both our FAST (detec-
tion) and BRIEF (description) implementations. The dominant causes of execution
stalls include Execution Dependency, Pipe Busy, Memory Throttle and Not Selected.
Execution dependency stalls occur when an instruction is waiting for at least one of
its inputs to be computed by earlier instructions. Pipe busy stalls are observed when
the computation unit required for the instruction is not available. Memory throttle
stalls are due to requiring a larger number of memory requests then the capabilities
of the load/store unit and thus not being able to accommodate all the requests in
time. Not selected stalls occur when the warp scheduler gives priority to another
kernel over the computation kernel and thus not allocating the required resources.
We observe that the FAST feature detection algorithm is heavily stalled due to
77
Table 4.1: Instruction number Comparison between GPU and FPGA implementations
Load/Store Floating Point and Integer ControlFAST FPGA 384000 2688000 12288000FAST+BRIEF FPGA 384000 3456000 17280000FAST GPU 304629 46310713 12000FAST+BRIEF GPU 2560475 64262713 1113277
execution dependencies compared to the description computation. On the other hand
the description computation is heavily stalled due to memory throttle. This shows
us that the FAST feature detection algorithm could benefit greatly from further
instruction level parallelism whereas the BRIEF feature descriptor algorithm would
benefit from better data management.
The total instruction count for our GPU and FPGA implementations are given
in Table 4.1. As before, the GPU instruction counts were taken from the NVIDIA
visual profiler tool. The corresponding instruction counts for the FPGA were esti-
mated from the design architecture and system level simulation. We estimate the
Load/Store instruction count as the number of accesses to the DDR3 and disregard
the accesses to the internal register buffers. The number of integer instructions is
estimated based on the total number of cycle operations the architecture takes, when
all the pipeline stages are unrolled. Therefore, each cycle of operation a single feature
candidate or descriptor goes through is evaluated as an instruction. There can be
multiple instructions in different pipeline layers, giving an instruction count metric
comparable to that of a CPU or GPU. The control instructions were estimated via a
code level analysis of branch instructions. Similar to the total of cycle instructions,
the estimations are made by multiplying total number of branches per mask opera-
tion with the total number of mask operations applied over an image and thus each
pipeline stage is considered as a separate MIMD instruction.
We observe that the feature detection kernel (i.e., FAST) corresponds to the ma-
78
94
.0
16
6.8
59
.3
30
.00
80
9.4
69
6.4
11
4.1
30
.96
11
37
.7
70
5.1
17
4.3
31
.58
92
7.9
6
31
.48
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
Intel i7 CPU Embedded CPU Tegra GPU Zynq FPGA
En
erg
y (m
J)
FAST FAST+BRIEF FAST+BRISK FAST+FREAK
4.4
25
.8
13
.8
13
.7
36
.4
10
3.7
18
.2
13
.7
50
.8
11
1.7
27
.6
13
.7
41
.8
13
.7
0.0
20.0
40.0
60.0
80.0
100.0
120.0
Intel i7 CPU ARM on Jetson Tegra GPU onJetson
Zynq FPGA
Ru
n-T
ime
(m
s)
FAST FAST+BRIEF FAST+BRISK FAST+FREAK
21
.5
6.5
4.3
2.2
0
22
.2
6.2 6.3
2.2
7
22
.4
6.3
6.3
2.3
1
22
.2
2.3
0
0.0
5.0
10.0
15.0
20.0
25.0
Intel i7 CPU ARM on Jetson Tegra GPU onJetson
Zynq FPGA
Po
we
r (W
)
FAST FAST+BRIEF FAST+BRISK FAST+FREAK
Figure 4.5: Run-time and power results for FAST feature detection and BRIEF/BRISK/FREAKfeature description algorithms over various embedded systems.
79
jority of the floating point and integer instructions, making up over 70% of the total
instructions issued. On the other hand, the number of load/store instructions are
significantly higher for the feature description kernel (i.e., BRIEF) specifically for
the GPU. For the FPGA implementation, the image is read from the external mem-
ory only once for both detection and description computations into our line buffers,
drastically reducing the number of memory accesses. Thereafter the corresponding
pixel data is propagated through our computation logic until it is discarded.
As seen in Figure 4.5, the power consumption of feature description algorithms
are very similar for our embedded CPU and GPU implementations whereas the FAST
feature detection consumes significantly less power on the GPU. The sliding window
operation for the detection part is efficiently distributed to the low power cores of the
GPU whereas the feature description suffers from the bottleneck of irregular memory
accesses in terms of power consumption.
The resource utilization of our FPGA implementations are reported in Table 4.2.
The additional logic for description computation for the FPGA implementation is
readily available to compute the description at each pixel coordinate, however unnec-
essary bit propagation is eliminated by registering the inputs in order to minimize
the cost of extra circuitry on power.
Table 4.2: Resource utilization on the Zynq FPGA
LUT FFs BRAMsFAST 4564 1551 8FAST+BRIEF 14398 2093 11FAST+BRISK 25575 7115 11FAST+FREAK 28684 7935 11
The run-time performance for the GPU and FPGA are more comparable. How-
ever the highly customizable architecture of the FPGA lends itself much better to
optimization of FAST+BRIEF and FAST+BRISK implementations. When FAST
80
feature detection is pipelined with description our results show that the GPU lags
behind in performance due to a lack of MIMD capabilities. Combined with the lower
power dissipation overhead for the FPGA boards, we can see a clear advantage of
FPGAs over the other platforms in terms of energy consumption with measure-
ments of 705mJ, 174mJ, and 16mJ for feature detection and description of WGA
size frames on embedded CPUS, GPUs and FPGAs respectively, giving the FPGA
a 98% advantage over the CPU implementation and a 90% advantage over the GPU
implementation.
4.4 Summary and Discussion
In this chapter we presented a comparative analysis of the FAST feature detection
algorithm along with BRIEF and BRISK feature description algorithms on vari-
ous embedded systems (embedded CPUs, GPUs and FPGAs) in terms of run-time
performance, power, and energy. We determined that the utilization of hardware-
oriented and power-aware design decisions such as deep pipelining, continuous filter
flow, and pre-computation steps allow high-throughput FPGA implementations to
outperform state-of-the-art embedded CPUs and GPUs in terms of both power and
performance. We show that despite the high-level of parallelization GPUs provide,
computation of multiple kernels is highly bounded by the kernel scheduler and mem-
ory bottlenecks whereas customization of FPGAs on layers can tackle operation of
multiple kernels much more efficiently. We have shown that the initial profiling of
GPU implementations can allow the designers to identify bottlenecks in a design and
deduce whether these bottlenecks can yield performance gains with custom hardware
programmability of FPGAs. This analysis constitutes a first step toward high perfor-
mance computer vision based embedded systems. Future work will build upon these
81
results by integrating real-time image sensor data and adding additional hardware
accelerated kernels such as those necessary for autonomous navigation and mapping
applications.
Chapter 5
Summary of Dissertation and
Possible Future Extensions
With the rising complexity of image processing and computer vision applications,
more pressure is being placed on designing architectures that can effectively deal
with their high throughput requirements. Additionally, the use of such systems in
highly resource-constrained mobile environments makes the trade-offs between design
constraints such as area and power even more impactful.
In this thesis, we have investigated different hardware accelerator platforms
specifically targeted for real-time image processing applications and explored how
smart algorithmic and architectural choices can lead to optimal designs that meet
specific constraints in area, power, or performance. We explored different embedded
systems and accelerated various image processing algorithms in order to demonstrate
the impact of various design choices.
82
83
5.1 Summary of Results
In Chapter 3 we explored techniques for fast design space exploration and multi-
objective design optimization for FPGA-based accelerators using two different image
processing applications. We utilized reconfigurability of FPGAs to implement a small
fraction of a large design space and apply regression analysis to obtain analytical
information and formulate scalable models to predict various design metrics such as
power, arithmetic accuracy, performance, and area.
As our first case study, we implemented an image deblurring accelerator for
FPGAs. For the accelerator design, the proposed models predict the implementation
metrics within 8% of measured power values, 10% within the output arithmetic
accuracy, and within 3% of actual FPGA resources used. We have used a full-
search block matching algorithm as a secondary test case and introduced system
throughput as an additional variable dependent on the design parameters. This
second accelerator design confirmed our findings about the benefits of the L1-based
modeling methodology. Our predictions were fairly close to the actual measurements
of the metrics with measurements within 10%, 4%, 6%, and 4% for power, area,
arithmetic accuracy and throughput respectively. Using these predictions, we were
able to accelerate the design space exploration process by a 340× speedup for the
image deblurring test case and a 90× speedup for the block matching test case. We
have also explored finding the optimal design parameters under various objectives
and constraints.
The work presented in Chapter 3 led us to expand our design parameters on
a different level than the algorithmic and architectural design choices and use the
embedded system as a design parameter as well. In Chapter 4, we presented a
84
comprehensive comparison between embedded CPU, GPU and FPGA implementa-
tions of FAST feature detection and BRIEF,BRISK and FREAK feature description
algorithms, evaluating their power and performance trade-offs while exploring the
architectural advantages and limitations of the acceleration platforms. We deter-
mined that the utilization of hardware-oriented and power-aware design decisions
such as deep pipelining, continuous filter flow, and pre-computation steps allow high-
throughput FPGA implementations to outperform state-of-the-art embedded CPUs
and GPUs in terms of both power and performance. We showed that despite the
high-level parallelization GPUs provide, computation of multiple kernels are highly
bounded by kernel scheduler and memory bottlenecks whereas customization of FP-
GAs on layers can tackle operation of multiple kernels much more efficiently. We
have shown that the initial profiling of GPU implementations can allow the design-
ers to identify bottlenecks in a design and deduce whether these bottlenecks can be
reduced with this custom hardware programmability of FPGAs.
5.2 Future Work
The work presented in this dissertation can be extended in a few directions. Our
exploration and optimization methodology presented in Chapter 3 can also be ex-
tended to ASICs, which would give a high range of architectural design decisions.
Our methodology should work equally well or even better for ASICs as they en-
able larger customization to the hardware designs. By combining the modeling
presented in Chapter 3, and the embedded system exploration in Chapter 4, a more
methodological approach to select embedded systems can be explored. The embed-
ded system can be used as a tertiary design choice in addition to the algorithmic
and architectural design options presented, which would also enable a wider range
85
of architectural design decisions to be explored for a wider range of embedded sys-
tems. Another direction would be to combine both approaches from Chapters 3
and 4 would be to train our regression based design space exploration methodology
based on the software driven and instruction level analysis presented in Chapter 4
and make predictions for hardware accelerators for a range of design decisions.
We have made heavy use of the designer’s knowledge of the algorithm and the
various architectural platforms to perform our design space explorations. In the
future, standardizing the architectural blocks to the image processing domain and
building a larger variety of algorithms using these standardized blocks would enable
our approach to perform an even earlier predictions on a wider variety of parameters.
With a heavier use of standardized blocks, an automated parameter selection process
could be devised to further reduce the need of user input for our L1 regularization
methodology. Reducing the dependency on the designer’s own knowledge before
making design level decisions would also accelerate the prototyping of such algorithms
for real-time use.
Finally, our work presented in Chapter 3 can also be expanded to devise separate
regression models for different parts of the design. For instance, if the control and
data sections of a design each were mapped to separate regression models, this may
lead to higher precision for predicting the various design metrics.
The analysis presented in Chapter 4 constitutes a first step toward high perfor-
mance computer vision based embedded systems. Future work could build upon
these results by considering additional hardware accelerated kernels such as those
necessary for autonomous navigation and mapping applications and exploring a large
system to be used on multiple embedded platforms combining the advantages pro-
vided by them.
Bibliography
[1] Nabil Abdelli, A-M Fouilliart, Nathalie Julien, and Eric Senn. High-level powerestimation of FPGA. In 2007 IEEE International Symposium on Industrial Elec-tronics, pages 925–930. Institute of Electrical & Electronics Engineers (IEEE),June 2007.
[2] Giovanni Agosta, Gianluca Palermo, and Cristina Silvano. Multi-objective co-exploration of source code transformations and design space architectures forlow-power embedded systems. In Proceedings of the 2004 ACM symposium onApplied computing, SAC ’04, pages 891–896, New York, NY, USA, 2004. ACM.
[3] Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst. FREAK: Fastretina keypoint. In Computer Vision and Pattern Recognition (CVPR), 2012IEEE Conference on, pages 510–517. Ieee, 2012.
[4] Arizona State University. YUV test sequences.http://trace.eas.asu.edu/yuv/index.html.
[5] Giuseppe Ascia, Vincenzo Catania, and Maurizio Palesi. A multiobjective ge-netic approach for system-level exploration in parameterized systems-on-a-chip.Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactionson, 24(4):635 – 645, april 2005.
[6] Luna Backes, Alejandro Rico, and Bjorn Franke. Experiences in speeding upcomputer vision applications on mobile computing platforms. In EmbeddedComputer Systems: Architectures, Modeling, and Simulation (SAMOS), 2015International Conference on, pages 1–8. Institute of Electrical & ElectronicsEngineers (IEEE), July 2015.
[7] Xuan-Quang Banh and Yap-Peng Tan. Adaptive dual-cross search algorithmfor block-matching motion estimation. IEEE Transactions on Consumer Elec-tronics, 50(2):766–775, May 2004.
[8] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robustfeatures. In Computer vision–ECCV, pages 404–417. Springer, 2006.
[9] Jan Biemond, Reginald L. Lagendijk, and Russell M. Mersereau. Iterative meth-ods for image deblurring. Proceedings of the IEEE, 78(5):856–883, May 1990.
86
87
[10] Dimitris Bouris, Antonis Nikitakis, and Ioannis Papaefstathiou. Fast and effi-cient FPGA-based feature detection employing the SURF algorithm. In Field-Programmable Custom Computing Machines (FCCM), pages 3–10. Institute ofElectrical & Electronics Engineers (IEEE), May 2010.
[11] H. Richard Byrd, Charles Jean Gilbert, and Jorge Nocedal. A trust regionmethod based on interior point techniques for nonlinear programming. Mathe-matical Programming, 89(1):149–185, 2000.
[12] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. BRIEF:Binary robust independent elementary features. Computer Vision–ECCV 2010,pages 778–792, 2010.
[13] Antonio Canclini, Matteo Cesana, Alessandro Redondi, Marco Tagliasacchi,Joao Ascenso, and Rodrigo Cilla. Evaluation of low-complexity visual featuredetectors and descriptors. In Digital Signal Processing (DSP), 2013 18th Inter-national Conference on, pages 1–7, July 2013.
[14] Hua-Yu Chang, Iris Hui-Ru Jiang, H. Peter Hofstee, Damir Jamsek, and Gi-Joon Nam. Feature detection for image analytics via FPGA acceleration. IBMJournal of Research and Development, 59(2/3):8:1–8:10, March 2015.
[15] Shuai Che, Jie Li, Jeremy W. Sheaffer, Kevin Skadron, and John Lach. Ac-celerating compute-intensive applications with gpus and fpgas. In ApplicationSpecific Processors, 2008. SASP 2008. Symposium on, pages 101–107. Instituteof Electrical & Electronics Engineers (IEEE), June 2008.
[16] Deming Chen, Jason Cong, Yiping Fan, and Zhiru Zhang. High-level powerestimation and low-power design space exploration for FPGAs. In Proceedingsof the 2007 Asia and South Pacific Design Automation Conference, ASP-DAC’07, pages 529–534, Washington, DC, USA, 2007. IEEE Computer Society.
[17] Nico Cornelis and Luc Van Gool. Fast scale invariant feature detection andmatching on programmable graphics hardware. In Computer Vision and PatternRecognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conferenceon, pages 1–8. Institute of Electrical & Electronics Engineers (IEEE), June 2008.
[18] NVIDIA Corporation. NVIDIA visual profiler.https://developer.nvidia.com/nvidia-visual-profiler.
[19] Piotr Czyzak and Adrezej Jaszkiewicz. Pareto simulated annealinga meta-heuristic technique for multiple-objective combinatorial optimization. Journalof Multi-Criteria Decision Analysis, 7(1):34–47, 1998.
[20] Joydip Das, Steven J. E. Wilton, Philip Leong, and Wayne Luk. Modeling post-techmapping and post-clustering FPGA circuit depth. In Field ProgrammableLogic and Applications, 2009. FPL 2009. International Conference on, pages205 –211, 31 2009-sept. 2 2009.
[21] Vijay Degalahal and Tim Tuan. Methodology for high level estimation of FPGApower consumption. In Proceedings of the 2005 Asia and South Pacific DesignAutomation Conference, ASP-DAC ’05, pages 657–660, New York, NY, USA,2005. ACM.
88
[22] Robert H. Dennard, Fritz H. Gaensslen, Hwa-Nien Yu, V. Leo Rideout, ErnestBassous, and Andre R. Leblanc. Design of ion-implanted MOSFET’s with verysmall physical dimensions. Proceedings of the IEEE, 87(4):668–678, April 1999.
[23] Michal Fularz, Marek Kraft, Adam Schmidt, and Andrzej Kasinski. A high per-formance FPGA based image feature detector and matcher based on the FASTand BRIEF algorithms. International Journal of Advanced Robotic Systems,12(141), 2015.
[24] Tony Givargis and Frank Vahid. Platune: a tuning framework for system-on-a-chip platforms. IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, 21(11):1317–1327, Nov 2002.
[25] Tony Givargis, Frank Vahid, and Jorg Henkel. System-level exploration forpareto-optimal configurations in parameterized systems-on-a-chip. In ComputerAided Design, 2001. ICCAD 2001. IEEE/ACM International Conference on,pages 25–30, 2001.
[26] Mark D. Hill. 21st century computer architecture: A community white paper.Technical report, The ACM Special Interest Group on Computer Architecture,2012.
[27] Ali Irturk, Bridget Benson, Shahnam Mirzaei, and Ryan Kastner. GUSTO: Anautomatic generation and optimization tool for matrix inversion architectures.ACM Trans. Embed. Comput. Syst., 9:32:1–32:21, April 2010.
[28] Dongsuk Jeon, M.B. Henry, Yejoong Kim, Inhee Lee, Zhengya Zhang,D. Blaauw, and D. Sylvester. An energy efficient full-frame feature extrac-tion accelerator with shift-latch FIFO in 28 nm CMOS. Solid-State Circuits,IEEE Journal of, 49(5):1271–1284, May 2014.
[29] Tianyi Jiang, Xiaoyong Tang, and Prith Banerjee. Macro-models for high levelarea and power estimation on FPGAs. In Proceedings of the 14th ACM GreatLakes symposium on VLSI, GLSVLSI ’04, pages 162–165, New York, NY, USA,2004. ACM.
[30] Nasser Kehtarnavaz and Mark Noel Gamadia. Real-time Image and Video Pro-cessing: From Research to Reality. Morgan & Claypool Publishers, 2006.
[31] Braislav Kisacanin, Shuvra S. Bhattacharyya, and Sek Chai. Embedded Com-puter Vision. Springer London, 2009.
[32] Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimationand model selection. In Proceedings of the 14th International Joint Conferenceon Artificial Intelligence - Volume 2, IJCAI’95, pages 1137–1143, San Francisco,CA, USA, 1995. Morgan Kaufmann Publishers Inc.
[33] Marek Kraft, Adam Schmidt, and Andrzej J Kasinski. High-speed image featuredetection using fpga implementation of fast algorithm. VISAPP (1), 8:174–9,2008.
[34] Murali E. Krishnan, E. Gangadharan, and Nirmal P. Kumar. H.264 motionestimation and applications. Technical report, InTech, 2012.
89
[35] A. V. Kulkarni, J. S. Jagtap, and V. K. Harpale. Object recognition with ORBand its implementation on FPGA. International Journal of Advanced ComputerResearch, 3(3):164–169, 2013.
[36] Tadahiro Kuroda. CMOS design challenges to power wall. In Microprocessesand Nanotechnology Conference, 2001 International, pages 6–7, Oct 2001.
[37] Benjamin C. Lee and David M. Brooks. Accurate and efficient regression mod-eling for microarchitectural performance and power prediction. SIGOPS Oper.Syst. Rev., 40:185–194, October 2006.
[38] Kwang-Yeob Lee. A design of an optimized ORB accelerator for real-time featuredetection. International Journal of Control & Automation, 7(3), 2014.
[39] Stefan Leutenegger, Margarita Chli, and Roland Y Siegwart. BRISK: Binaryrobust invariant scalable keypoints. In Computer Vision (ICCV), 2011 IEEEInternational Conference on, pages 2548–2555. IEEE, 2011.
[40] Reoxiang Li, Bing Zeng, and M. L. Liou. A new three-step search algorithmfor block motion estimation. IEEE Transactions on Circuits and Systems forVideo Technology, 4(4):438–442, Aug 1994.
[41] Lei Liang and Yuanchang Xu. Adaptive landweber method to deblur images.IEEE Signal Processing Letters, 10(5):129–132, May 2003.
[42] Alexander Ling, Dhirendra Pratap Singhh, and Stephen D. Brown. FPGA tech-nology mapping: a study of optimality. In Proceedings. 42nd Design AutomationConference, 2005., pages 427–432, June 2005.
[43] David G. Lowe. Distinctive image features from scale-invariant keypoints. In-ternational Journal of Computer Vision, 60(2):91–110, nov 2004.
[44] Ondrej Miksik and Krystian Mikolajczyk. Evaluation of local detectors anddescriptors for fast feature matching. In Pattern Recognition (ICPR), 2012 21stInternational Conference on, pages 2681–2684, Nov 2012.
[45] Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen, Ram Rajamony, andRaj Rajkumar. Critical power slope: Understanding the runtime effects offrequency scaling. In Proceedings of the 16th International Conference on Su-percomputing, ICS ’02, pages 35–44, New York, NY, USA, 2002. ACM.
[46] Douglas C. Montgomery. Design and Analysis of Experiments. Wiley, 2012.
[47] Gordon E. Moore. Cramming more components onto integrated circuits. Pro-ceedings of the IEEE, 86(1):82–85, Jan 1998.
[48] D. Nagamalai, E. Renault, and M. Dhanuskodi. Advances in Digital ImageProcessing and Information Technology: First International Conference on Dig-ital Image Processing and Pattern Recognition, DPPR 2011, Tirunelveli, TamilNadu, India, September 23-25, 2011, Proceedings. Communications in Com-puter and Information Science. Springer Berlin Heidelberg, 2011.
[49] Kumud Nepal. New Directions for Design-Space Exploration of Low-PowerHardware Accelerators. PhD thesis, Brown University, 2015.
90
[50] David Nister, Oleg Naroditsky, and James Bergen. Visual odometry. In Com-puter Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the2004 IEEE Computer Society Conference on, volume 1, pages I–652–I–659 Vol.1,June 2004.
[51] Gianluca Palermo, Cristina Silvano, and Vittorio Zaccaria. Multi-objective de-sign space exploration of embedded systems. J. Embedded Comput., 1(3):305–316, August 2005.
[52] Giuseppe Ascia Vincenzo Catania Maurizi Palesi. A framework for design spaceexploration of parameterized VLSI systems. In Proceedings of the 2002 Asiaand South Pacific Design Automation Conference, ASP-DAC ’02, pages 245–250, Washington, DC, USA, 2002. IEEE Computer Society.
[53] Rajat Phull, Pradip Mainali, Qiong Yang, Patrice Rondao Alface, and HenkSips. Low complexity corner detector using CUDA for multimedia applications.MMEDIA 2011, 2011.
[54] Ab Al-Hadi Ab Rahman, R. Thavot, M. Mattavelli, and P. Faure. Hardware andsoftware synthesis of image filters from CAL dataflow specification. In Ph.D.Research in Microelectronics and Electronics (PRIME), 2010 Conference on,pages 1–4, July 2010.
[55] Iain E. Richardson. H.264 and MPEG-4 Video Compression: Video Coding forNext-generation Multimedia. Wiley, 2003.
[56] Edward Rosten and Tom Drummond. Fusing points and lines for high perfor-mance tracking. In Computer Vision, 2005. ICCV 2005. Tenth IEEE Interna-tional Conference on, volume 2, pages 1508–1515, Oct 2005.
[57] Edward Rosten and Tom Drummond. Machine learning for high-speed cornerdetection. In Proceedings of the 9th European Conference on Computer Vision- Volume Part I, ECCV’06, pages 430–443, Berlin, Heidelberg, 2006. Springer-Verlag.
[58] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: Anefficient alternative to SIFT or SURF. In Computer Vision (ICCV), 2011 IEEEInternational Conference on, ICCV ’11, pages 2564–2571, Washington, DC,USA, Nov 2011. IEEE Computer Society.
[59] Maria Santamaria and Maria Trujillo. A comparison of block-matching motionestimation algorithms. In Computing Congress (CCC), 2012 7th Colombian,pages 1–6. Institute of Electrical & Electronics Engineers (IEEE), Oct 2012.
[60] Michael Schaeferling and Gundolf Kiefer. Object recognition on a chip: A com-plete SURF-based system on a single FPGA. In Reconfigurable Computing andFPGAs (ReConFig), 2011 International Conference on, pages 49–54. Instituteof Electrical & Electronics Engineers (IEEE), Nov 2011.
[61] Benjamin Carrion Schafer and Kazutoshi Wakabayashi. Machine learning pre-dictive modelling high-level synthesis design space exploration. Computers Dig-ital Techniques, IET, 6(3):153–159, May 2012.
91
[62] David Sheldon and Frank Vahid. Making good points: application-specificpareto-point generation for design space exploration using statistical meth-ods. In Proceeding of the ACM/SIGDA international symposium on Field pro-grammable gate arrays, FPGA ’09, pages 123–132, New York, NY, USA, 2009.ACM.
[63] Lee Chee Sing and Ha Yajun. Design space exploration for arbitrary FPGAarchitectures. In Proceedings of the Second International Conference on Embed-ded Software and Systems, ICESS ’05, pages 269–275, Washington, DC, USA,2005. IEEE Computer Society.
[64] Alastair M. Smith, Steven J.E. Wilton, and Joydip Das. Wirelength modelingfor homogeneous and heterogeneous FPGA architectural development. In Pro-ceeding of the ACM/SIGDA international symposium on Field programmablegate arrays, FPGA ’09, pages 181–190, New York, NY, USA, 2009. ACM.
[65] Stephen M. Smith and J. Michael Brady. SUSAN - a new approach to low levelimage processing. International Journal of Computer Vision, 23:45–78, 1995.
[66] Byoungro So, Mary W. Hall, and Pedro C. Diniz. A compiler approach tofast hardware design space exploration in FPGA-based systems. In Proceedingsof the ACM SIGPLAN 2002 Conference on Programming language design andimplementation, PLDI ’02, pages 165–176, New York, NY, USA, 2002. ACM.
[67] Jan Svab, Tomas Krajnik, Jan Faigl, and Libor Preucil. FPGA based speeded uprobust features. In Technologies for Practical Robot Applications, 2009. TePRA2009. IEEE International Conference on, pages 35–41. Institute of Electrical &Electronics Engineers (IEEE), Nov 2009.
[68] Kuen Hung Tsoi and Wayne Luk. Power profiling and optimization for heteroge-neous multi-core systems. SIGARCH Comput. Archit. News, 39:8–13, December2011.
[69] Onur Ulusel, Kumud Nepal, R. Iris Bahar, and Sherief Reda. Fast design ex-ploration for performance, power and accuracy tradeoffs in FPGA-based ac-celerators. ACM Trans. Reconfigurable Technol. Syst., 7(1):4:1–4:22, February2014.
[70] Gooitzen van der Wal, David Zhang, Indu Kandaswamy, James Marakowitz,Kevin Kaighn, Joe Zhang, and Sek Chai. FPGA acceleration for feature basedprocessing applications. In 2015 IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW), pages 42–47. Institute of Electrical& Electronics Engineers (IEEE), June 2015.
[71] Rick Weber, Akila Gothandaraman, Robert J. Hinde, and Gregory D. Peterson.Comparing hardware accelerators in scientific applications: A case study. IEEETransactions on Parallel and Distributed Systems, 22(1):58–68, Jan 2011.
[72] Wikipedia. Pareto efficiency, 2016. [Online; accessed 31-May-2016].
92
[73] Hongtao Xie, Ke Gao, Yongdong Zhang, Jintao Li, and Yizhi Liu. GPU-basedfast scale invariant interest point detector. In Acoustics Speech and Signal Pro-cessing (ICASSP), 2010 IEEE International Conference on, pages 2494–2497,March 2010.
[74] Xilinx. ML605 Hardware User Guide, 2011.
[75] Lu Yuan, Jian Sun, Long Quan, and Heung-Yeung Shum. Image deblurringwith blurred/noisy image pairs. ACM Trans. Graph., 26(3), July 2007.
[76] Anatoly A. Zhigljavsky. Theory of Global Random Search, volume 65 of Math-ematics and its Applications. Springer Netherlands, 1 edition, 1991.
[77] Ce Zhu, Xiao Lin, and Lap-Pui Chau. Hexagon-based search pattern for fastblock motion estimation. IEEE Transactions on Circuits and Systems for VideoTechnology, 12(5):349–355, May 2002.
[78] Shan Zhu and Kai-Kuang Ma. A new diamond search algorithm for fast block-matching motion estimation. IEEE Transactions on Image Processing, 9(2):287–290, Feb 2000.
[79] Xiang Zhu and Peyman Milanfar. Restoration for weakly blurred and stronglynoisy images. In Applications of Computer Vision (WACV), 2011 IEEE Work-shop on, pages 103–109. Institute of Electrical & Electronics Engineers (IEEE),Jan 2011.