canny report

EECS 222A: SYSTEM-ON-CHIP DESCRIPTION AND MODELING

Modeling of a Canny Edge Detector System-on-Chip

for a Digital Camera

Vivekanand Veeracholan 17292864

06/13/2012

ii

ABSTRACT

In Image processing, when we want extract the object of interest from the remaining image

data the first stage is detecting the edges of all the objects in the image and filtering the

objects with the required features. So edge detection becomes the necessary step in image

processing. This step is normally done in a computer. In this project we wanted to design a

customized system for edge detection to be embedded in digital cameras. The algorithm we

chose for edge detection is Canny. This is simple and easy algorithm to implement. We

designed the entire system using SpecC language and the simulation gave a good result.

iii

CONTENTS

S.No Title Page No. 1. Introduction a. System Level Modeling

b. System Level Description Languages 1 3

2. Case Study on a Canny Edge Detector SoC a. Canny Application Reference C Code. b. System Level Model in SpecC. c. Estimation, Optimization and Refinement

using SCE.

5 5 8

3. Conclusion 9 4. Reference 10

1

1. INTRODUCTION

In designing a system, we need to make lot of decisions and the two main decisions that reflect

on all other decisions are choosing the model of computation and the selection of description

language. These two parameters decide the flow of design and tools required. Their selection

depends on the nature of the system we will be developing. In the following subsections we will

discuss about the System level modeling and the System level description languages.

a. SYSTEM LEVEL MODELING

At the beginning of any project, the only thing we will have in our hands is what the black box

we are going to design must perform. These requirements will lead us to the functional

description of the system. This is the first model of the project, Specification model. This is

highest level of abstraction in modeling the system. During the design process, the level of

abstraction will start decreasing as we put in more details to the design and we will reach the

level which can be synthesized. The following pyramid shows the relation between levels of

abstraction, number of components and the accuracy of our design.

This clearly shows that at the highest level of abstraction, the components we will be working

with is very less compared to the lowest level which is result of adding more and more details

and requirement to the design. The accuracy also improves as we move towards lower level of

abstraction and this is because we can individually specify the behavior of the components that

perform the very basic operation. For example, at transistor level abstraction we will know the

2

width and length of the transistor which helps us to predict the exact timing of the gates and

eventually the timing of the entire system.

The following figure will help you understand the models as we move towards the lower level

of abstraction.

As you can see, we start with the requirements and specification model which is pure functional

description, then we add more details to the system like different processing elements,

Architecture model, the deciding on the communication networks in the chip, Communication

model, deciding which technology to use and the RTL, Implementation model. In some systems

where the number of tasks is high, the scheduling of those tasks plays an important role in the

efficiency of the system.

3

Figure 3. will give an overall view of the SoC design flow.

b. System Level Description Languages

Once we have the computation model of the system, we must choose a language to capture

that computational model correctly and we should be able to achieve the desired result. Over

the years there have been a number of languages that helped and helping us with this process.

Goals and Requirements of these Languages.

Formality

Executability

Synthesizability

Modularity

Completeness

Orthogonality

Simplicity

4

Few of the languages are

C

o Good with functional representation. Cannot be used for hardware level

modeling.

C++

o Same as C with additional feature of exception handling.

Java

o Like C++ with features of concurrency and synchronization.

VHDL

o Hardware Description language. Has almost all features required to synthesize a

hardware with structural hierarchy.

Verilog

o Another variant of VHDL

SpecC

o Perfect for capturing the system. Have all the features that are missing in the

above mentioned languages.

SystemC

o It is more like a library to C++ than a language. It also has all the features of

SpecC

The following figure gives the capability of each language in the context of system level

modeling.

5

2. Case Study on a Canny Edge Detector SoC.

In order to understand the design flow of the SoC, we decided to design a SoC for Canny edge

detection algorithm for digital camera and simulate it for analyzing it. Edge detection is a very

important part in any image processing algorithms related to object recognition, machine vision

Canny Edge Detector

The Canny edge detector is an edge detection operator that uses a multi-stage algorithm to

detect a wide range of edges in images. This algorithm is a very optimal (i.e.) it has good

detection, localization and response. The main stages of the algorithm are

Noise reduction using Gaussian smoothing

Finding the intensity gradient of the image

Non-maximum suppression

Tracing edges through the image and hysteresis thresholding.

a. Canny Application Reference C Code & Porting to SpecC

To start with we downloaded an existing code from internet. The code was written by Mike

Heath from University of South Florida. The entire code was written across 3 source files

"canny_edge.c", "hysteresis.c" and "pgm_io.c". The code was written for both “.pgm” and

“.ppm” type of images. The memory required are dynamically allocated rather than static.

In order to make the code run on SpecC we had to change few things because of the limitations

on the SpecC compiler. It is not as relaxed as GCC in terms of variable declaration and constant

assignments and also the compiler has no NULL keyword. Our goal is to synthesize it in

hardware so dynamic memory allocation doesn’t make any sense. Because of this we removed

all the dynamic allocation from the code and replaced them with fixed size arrays. Due to this

we also created a new limitation on the input image size. The image size was restricted to

320X240 pixels. After these basic changes the code compiled and produced the expected result.

b. System Level Model in SpecC

Once those initial changes are made and the reference code runs on SpecC compiler we had to

start modeling it like a system. So the code is modeled with the structural hierarchy shown in

the next page. This is the test bench that is used for the entire project. The entire program is

split into three main behaviors ‘Stimulus’, ‘Platform’ and ‘Monitor’. The Platform behavior

represents the actual chip while stimulus and monitor represents the image input interface like

the CMOS sensor of Digital Camera and image output like the LCD interface.

6

Structural Hierarchy of the model. B i o behavior Main B i l |------ Monitor monitor B i c |------ Platform platform B i l | |------ DUT canny B i l | |------ DataIn din B i l | |------ DataOut dout C i l | |------ c_img_queue q1 C i l | \------ c_img_queue q2 B i l |------ Stimulus stimulus C i l |------ c_img_queue q1 C i l \------ c_img_queue q2

Stimulus: This behavior implements the Read_pgm() to read the image and sends the read image to the behavior Platform through the port P. The communication channel between Stimulus and Platform is a simple Queue q1.

Queue Queue

q1 q2

P

STIMULUS

Read_pgm()

P.Send(img)

PLATFORM Queue Queue

q1 q2

P

MONITOR

P.Read(img)

Write_pgm(img)

Exit()

P1 P2

DATA IN

P1.Read(img)

P2.Send(img)

DUT

Canny()

Hysteresis()

Gaussian()

P1 P2

DATA OUT

P1.Read(img)

P2.Send(img)

7

Platform: The platform has its own Data_in and Data_out interfaces to communicate with other behaviors instead of directly communicating with stimulus and monitor. These modules are included to make the future modifications easier. That is if we intend to change the interface between the stimulus or monitor and Platform we need not disturb the entire code instead we can simply modify the Data_in and Data_out. Data_in is the interface between Platform and Stimulus. DUT is the main behavior that implements the full functionality of the canny application. All the functions related to edge detection are implemented in the DUT behavior. Data_out is the interface between Platform and Monitor. Monitor: This behavior reads the processed image from the Platform and writes it to the file using Write_pgm() function. The interface between Platform and Monitor is also a simple Queue q2.

IMPROVING THE HIERARCHY Just a single behavior for all the edge detection functions will lead to less flexible design. That is we cannot modify the design later. So to make the more flexible, the Canny behavior is broke into smaller behaviors representing each functions. The new structural hierarchy is as follows B i o behavior Main B i l |------ Monitor monitor B i c |------ Platform platform B i s | |------ DUT canny B i l | | |------ Apply_Hysteresis apply_hysteresis B i l | | |------ Derivative_X_Y derivative_x_y B i l | | |------ Gaussian_Smooth gaussian_smooth B i l | | |------ Magnitude_X_Y magnitude_x_y B i l | | \------ Non_Max_Supp non_max_supp B i l | |------ DataIn din B i l | |------ DataOut dout C i l | |------ c_img_queue q1 C i l | \------ c_img_queue q2 B i l |------ Stimulus stimulus C i l |------ c_img_queue q1 C i l \------ c_img_queue q2

8

c. Estimation, Optimization and Refinement using SCE.

With the initial hierarchy we simulated the system and got the execution time distribution. The following graph shows it.

This clearly shows that the Gaussian_smooth function dominates the entire computation time. The Gaussian_smooth function has two main parts blurring in X and blurring in Y. These two parts are completely independent internally (i.e.) they can be parallelized individually. Four instances are created for BlurX and four instances for BlurY. Before parallelizing this part took 400 ms and now it takes 100ms.

Architectural Refinement: With optimized model in hand, the next step is to decide on hardware units to be allocated for the behaviors. The SCE tool offered different processors like ARM7TDMI, Motorolla, ARM 9, different DSPs and many custom hardware options. The code does not have any DSP requirement. So ARM7TDMI was chosen for the main controls. Individual custom hardware was chose for BlurX and BlurY behaviors. In selecting the hardware for Blur functions also has two options, using the same hardware for BlurX and BlurY because BlurY is executed after BlurX. So this is upto the designer and the decision is the tradeoff between performance and chip cost. In this project, the decision was made in favor of individual hardware units. For the Data_in and Data_out is allocated virtual hardwares. So the following clearly shows the mapping of Processing Elements and behaviors.

Behavior Processing Element

Canny ARM7TDMI

BlurX Custom Hardware

BlurY Custom Hardware

Data_in , Data_out Virtual Hardware

9

Scheduling Refinement Once the decision on processing element is made, the next step is to decide on scheduling the tasks which share the same hardware unit. Fortunately in this project there is no necessity of scheduling because the functions are sequential and the only parallel part is Gaussian and that is given custom hardware units.

Network Refinement After the scheduling the communication channels between individual hardware units has to be defined. In this project the hardware units are 1 ARM core, Virtual hardware units and 8 custom hardware units. The ARM core comes with the AMBA bus architecture therefore any communication to and from ARM can be done using the AMBA architecture. Now the communication between custom hardware units needs to finalized. We don’t need to have a complex protocol so a simple double handshake protocol is selected. All BlurX hardware units have two ports one for AMBA bus, input from ARM and other for double handshake bus , output to BlurY units. All BlurY units have 5 ports. One for AMBA bus, output to ARM core and four input ports for double handshake busses from each BlurX unit.

Transaction Level Model With the completion of the network refinement, Transaction Level model was created. After all this refinement we got accurate simulation result of 501 ms for single image.

3. CONCLUSION Starting with just the requirement and a sample C source code, the system has been developed step by step going through all the levels of abstraction. Though final RTL refinement which is required for synthesis is not done due to the short duration of the project, the results obtained are satisfactory. The individual execution times of the Processing elements ARM core - 501.9ms BlurX HW - 47.9ms BlurY HW - 51.3ms

ISSUES AND RELATED FUTURE WORK As you can clearly see from the execution times, the system takes nearly 600ms for processing one picture which means we can effectively process only 1 to 1.5 frames per second but in real time videos will have a minimal frame rate of 24.96. So this makes our system to be obsolete. In order to improve the performance, we can increase the frequency of ARM core. The maximum frequency supported is 500MHz. So this will give a speedup of 5 times, that is 7.5 frames per second still not good enough. After parallelizing the Gaussian smooth, the function that involves heavy computation is No_Max_support. This function is not only computationally intensive but also involves lot of floating point operations. So by using fixed point computation we can achieve some more improvement in performance.

10

The next issue is, all the simulation results we got are just estimates they are not real. There can be variations in the result after synthesis. This will further lower the processing power. Is 320X240 an acceptable resolution these days? No. Most of the digital camera these days captures videos with a minimal resolution of 1280X720. Our design cannot process a single image of this high resolution. So we need to complete parallelize our approach like in graphics card use as many cores possible and process the image. It is possible to work on these issues by proper selection of hardware units and optimization of code with some tradeoff in accuracy of results and make our system work in real time video processing.

4. References 1. ftp://figment.csee.usf.edu/pub/Edge_Comparison/source_code/canny.src 2. http://en.wikipedia.org/wiki/Canny_edge_detector

3. http://www.cecs.uci.edu/~doemer/publications/SpecC_LRM_20.pdf 4. http://www.cecs.uci.edu/~cad/sce.html

ftp://figment.csee.usf.edu/pub/Edge_Comparison/source_code/canny.src

http://en.wikipedia.org/wiki/Canny_edge_detector

http://www.cecs.uci.edu/~doemer/publications/SpecC_LRM_20.pdf

http://www.cecs.uci.edu/~cad/sce.html

canny report

Documents

individual

custom hardware

canny application

system level

system level

hardware units

amba bus

edge detection