ahomepages.cae.wisc.edu/~ece734/project/s06/roy_oh_re…  · web viewobject recognition is used in...

41
ECE 734 Final Project Report Acceleration of motion estimation by edge detection algorithm using PLX subword parallel ISA ECE 734 Project Final Report Submitted by Sanghamitra Roy and Dongkeun Oh 1

Upload: others

Post on 26-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

ECE 734 Final Project Report

Acceleration of motion estimation by

edge detection algorithm using PLX

subword parallel ISAECE 734 Project Final Report

Submitted by Sanghamitra Roy and Dongkeun Oh

1

ECE 734 Final Project Report

2

ECE 734 Final Project Report

Table of contents

1 Introduction and motivation--------------------------- 3

2 Overview of edge detection algorithm--------------- 4

3 Structure of Canny’s algorithm and its hardware

implementation------------------------------------------- 7

4 PLX subword parallel architecture: Overview------- 11

5 Algorithm for hardware implementation-------------- 11

6 Initial approach for hardware implementation-------- 12

7 Memory efficient hardware implementation --------- 15

8 Experimental results ------------------------------------- 17

9 Conclusion and future work ---------------------------- 19

10 References------------------------------------------------- 21

11 Appendix-------------------------------------------------- 22

3

ECE 734 Final Project Report

1. Introduction and motivation With rapid increase in the amount of multimedia information over the internet,

there has been a remarkable rise in the demand of video-driven applications such as

teleconference, videophone, and image-based multimedia services. Thus, the amount of

video information to be transmitted in the network has increased, although the

transmission rate in the network has not increased at the same rate. Hence, low bit-rate

video coding techniques have become necessary to ease these bottlenecks.

The low bit-rate video coding algorithms can be divided into two categories. The

first category consists of block-based algorithms such as H.261, H.263, MPEG-1, and

MPEG-2. These algorithms are easy to implement and maintain a relatively good image

quality at low bit rates. However, at very low bit rates, less than 28.8kbps, blocking and

mosquito artifacts become visible and the reconstructed image quality becomes degraded.

This is the reason why this strategy is not employed in MPEG-4. The second category is

object or segmentation based coding. Many techniques for object based coding at very

low bit rates have already been proposed. Object based coding achieves high

compression rate by subdividing an image into a number of arbitrarily shaped objects and

the background, and by performing motion estimation of objects. The greatest advantage

of this method is the ability to perform accurate motion estimation of moving objects and

utilize the available bit rate efficiently, by focusing on moving objects. Therefore, the

quality of images produced by this method varies dramatically depending on the quality

of object segmentation.

The object oriented approach supports high quality resolution for each individual

object. The accurate motion representation of the object is the key to good motion

compensation for coding purposes as well as for image format conversion. However,

most of the object-based coding approaches are computationally expensive. Object

segmentation and recognition is also a primary step of computer vision. Object

recognition is used in many areas such as traffic monitoring and robot vision. While a

single image provides a snapshot of a scene, different frames of a video taken over time

represent the dynamics in the scene, making it possible to capture the motion in the

4

ECE 734 Final Project Report

sequence. The recognition process of a moving object is processed in real time, which

requires high performance image processors.

Edge detection or object segmentation is the crucial part of object recognition.

Edge features, which are recognized as an important aspect of human visual perception,

are commonly used in shape analysis. Decomposition of images into two regions of low-

frequency blocks and blocks containing visually important features such as edges or lines

requires analysis of visual continuity of the image.

The objective of this project is to improve the computational power of an image

processor by accelerating the edge detection algorithm. We propose to enhance the

performance of the edge detection algorithm using sub-word parallelism, and implement

this algorithm using PLX subword parallel ISA.

2. Overview of Edge Detection algorithm An edge in an image corresponds to a discontinuity in the intensity surface of the

underlying scenes – a jump in intensity from one pixel to the next. Edge detecting

significantly reduces the amount of data and filters out useless information, while

preserving the important structural properties in an image. There are many ways to

perform edge detection. However, the majority of different methods may be grouped into

two categories, gradient and Laplacian. The gradient method detects the edges by looking

for the maximum and minimum in the first derivative of the image. The Laplacian

method searches for zero crossings in the second derivative of the image to find edges.

An edge has the one-dimensional shape of a ramp. Calculating the derivative of the

image can highlight its location. Suppose we have the following signal, with an edge

shown by the jump in intensity below in figure 1:

Figure 1: Intensity profile of pixels in 1D line

5

ECE 734 Final Project Report

If we take the gradient of this signal (which, in one dimension, is just the first

derivative with respect to t) we get the following as shown in figure 2:

Figure 2: 1st Derivative of pixel intensity

The derivative shows a maximum located at the center of the edge in the original

signal. This method of locating an edge is characteristic of gradient filter family of edge

detection filters. A pixel location is declared an edge location if the value of the gradient

exceeds some threshold. As mentioned before, pixels in edges will have higher intensity

values than those surrounding it. So once a threshold is set, we can compare the gradient

value to the threshold value and detect an edge whenever the threshold is exceeded.

Furthermore, when the first derivative is at a maximum, the second derivative is zero. As

a result, another alternative to finding the location of an edge is to locate the zeros in the

second derivative. This method is known as the Laplacian and the second derivative of

the signal is shown in figure 3:

Figure 3: 2nd Derivative of pixel intensity

6

ECE 734 Final Project Report

Based on this one-dimensional analysis, the theory can be carried over to two-dimension

as long as there is an accurate approximation to calculate the derivative of a two-

dimensional image.

The gradient of an image f(x,y ) at location (x,y) is defined as the vector

(1)

It is well known from vector analysis that the gradient vector points in the direction of

maximum rate of change of f at coordinates (x,y).

The magnitude of this vector

(2)

is an important quantity in the edge detection process which provides the maximum rate

of increase of f(x,y) per unit distance in the direction of . The direction of the gradient

vector

(3)

represents the direction angle of the vector at (x,y).

Computation of the gradient of an image is based on obtaining the partial derivative

and , at every pixel location. Let us consider the 3*3 area shown in Figure

4.

z1 z2 z3z4 z5 z6z7 z8 z9

Figure 4: 3*3 region of an image

7

ECE 734 Final Project Report

Figure 5: Masks based on (a) Prewitt (b) Sobel operator

At a pixel z5 of a 3*3 image, the first order derivative by Prewitt operator is given by

(4)

The magnitude using equation (2) is computationally expensive and thus we use an

approximate equation using absolute values.

(5)

3. Structure of Canny’s algorithm and its hardware implementation

Structure of canny’s algorithm

We select canny’s algorithm for the implementation of the edge detection process

because it is considered as a “standard method” of edge detection. The base program

source file is found at [8] and we modify this program to produce sample data, binary file

8

ECE 734 Final Project Report

for PLX simulation and simulate the whole edge detection process. Some sample outputs

of edge detection using our C code are shown in figure 7.

The program consists of 5 main function modules. Their function names are

gaussian_smooth, derivative_x_y, magnitude_x_y, non_max_supp, and apply_hysteresis.

In the first stage, it performs linear filtering with a Gaussian kernel to smooth the noise in

the image. Here, the pixel color data is converted into grayscale value. In the second

stage, computation of the edge strength and direction for each pixel in the smoothed

image is performed. This is done by differentiating the image in two orthogonal

directions and computing the partial derivatives. The third stage calculates the gradient

magnitude as the root sum of squares of the derivatives. The gradient direction is

computed using the arctangent of the ratio of the derivatives. We use the sum of absolute

value of the two derivatives to approximate the magnitude.

Figure 6: Derivative masks for Canny’s algorithm

9

ECE 734 Final Project Report

Figure 7: Original images and edge detected images using our C code

Figure 8: Block diagram of canny’s algorithm program using AQtime profiler

In the fourth stage, candidate edge pixels are identified as the pixels that survive a

thinning process called non-maximal suppression. In this process, the edge strength of

each candidate edge pixel is set to zero if its edge strength is not larger than the edge

strength of the two adjacent pixels in the gradient direction. Thresholding is then done on

the thinned edge magnitude image using hysteresis. In the fifth stage of hysteresis, two

edge strength thresholds are used. All candidate edge pixels below the lower threshold

are labeled as non-edges and all pixels above the lower threshold that can be connected to

any pixel above the higher threshold through a chain of edge pixels are labeled as edge

pixels.

Hardware implementation and Project purpose

The acceleration of this algorithm can be done in various ways. Upon inspection

of the various routines in the C code implementation, we find that the derivative_x_y

function module is a good candidate for being converted into PLX ISA using subword

10

ECE 734 Final Project Report

parallelism. Before proceeding to the hardware implementation of the derivative part of

(x,y) pixels, we briefly review the overall run time of the whole edge detection algorithm.

.

Figure 9: Snapshot of the profile of Canny’s program produced by AQtime

Figure 10: Run time weight of each program module in Canny’s Edge detection program

Figure 10 shows the runtime profile of canny’s program by the AQtime profiler. It

shows the percentage weight of run time for the whole body of the algorithm. From

figures 9 and 10, it can be observed that the gaussian smoothing stage occupies half of

the execution time of the program. This is because it uses a lot of multiplication and

division operations for each pixel. Therefore, the acceleration of this gaussian smoothing

may be done by using high performance multiplication and division modules. PLX ISA

does not directly support division process and so its hardware implementation goes

11

ECE 734 Final Project Report

beyond the scope of this project. The second stage of Canny’s algorithm, i.e., the

computation part of the derivative, can be the candidate of hardware acceleration using

subword parallelism because its loop structure is symmetric and it can be easily

parallelized into PLX subword ISA. Also this derivative calculation step is common to all

edge detection algorithms. So speeding up this step can benefit other edge detection

schemes too. All the other three function modules can be accelerated by using the same

approach. Thus, we will concentrate on the hardware implementation of this part and we

will leave implementation of the other parts for future work.

4. PLX subword parallel architecture: OverviewWe give a very brief overview of the PLX architecture in this section. We avoid

going into too much detail as this is already a part of our course. PLX is a small subword

parallel instruction set architecture developed at Princeton University. It uses SIMD type

of instructions for parallel operation and faster performance. PLX supports 1,2,4 or 8 byte

sub-words. It has 32 general purpose registers which may be 32, 64 or 128 bit wide. This

wordsize scalability allows design tradeoffs between performance and cost. PLX also

supports predicated execution. Reading or writing data from the memory requires using

aligned memory address (4/8/16 byte boundaries).

5. Algorithm for hardware implementation

We picked the partial derivative estimation algorithm for hardware

implementation and optimization using the PLX instruction set architecture. This

algorithm calculates the partial x-derivative and partial y-derivative for all pixel values of

the image. The algorithm contains basic addition and subtraction operations using the

derivative mask of figure 6. The C code snippet for this algorithm is shown below.

for(r=0; r < rows; r++){

for(c=0; c < cols; c++){

12

Computing y-derivativeComputing x-derivative

ECE 734 Final Project Report

pos = r * cols;del_x[pos] = s[pos + 1] – s[pos];

pos++for(c = 1; c < (cols – 1); c++, pos++) {

del_x[pos] = s[pos + 1] – s[pos – 1];

}del_x[pos] = s[pos] – s[pos – 1];

}

pos = c;del_y[pos] = s[pos + cols] – s[pos];

pos += cols;for(r = 1; r < (rows – 1); r++,

pos+= cols) {del_y[pos] =

s[pos + cols] – s[pos – cols];}del_y[pos] = s[pos] – s[pos – cols];

}

As it can be seen, the algorithm is mostly symmetric, except the first and last

calculations of every row, and column. We have the following objectives for

implementing this algorithm using PLX subword parallel architecture:

i) Explore parallelism in the algorithm by loop unrolling

ii) Minimize memory accesses to reduce execution time bottle neck

iii) Design memory access and data fetching to maximize cache hit

iv) Optimize the PLX code for faster performance

v) Verify accuracy of the PLX implementation with the C code

6. Initial approach for hardware implementation Our algorithms have been implemented for 100*100 pixel images having a total

of 10000 pixels. At first we implement the above algorithm by unfolding the symmetric

inner loops as shown below:

for(r=0; r < 100; r++){

pos = r * 100;del_x[pos] = s[pos + 1] – s[pos];

pos++for(c = 1; c < 25; c++, pos += 4) {

del_x[pos] = s[pos + 1] – s[pos – 1];

del_x[pos + 1] =

for(c=0; c < 100; c++){

pos = c;del_y[pos] = s[pos + 100] – s[pos];

pos += 100;for(r = 1; r < 25; r++,

pos+= 400) {del_y[pos] =

s[pos + 100] – s[pos – 100];

13

Loop unfolded x-derivative Loop unfolded y-derivative

ECE 734 Final Project Report

s[pos + 2] – s[pos];

del_x[pos + 2] = s[pos + 3] – s[pos + 1];

del_x[pos + 3] = s[pos + 4] – s[pos + 2];

} ……

del_x[pos] = s[pos] – s[pos – 1];}

del_y[pos + 100] = s[pos + 200] – s[pos];

del_y[pos + 200] = s[pos + 300] – s[pos + 100];

del_y[pos + 300] = s[pos + 400] – s[pos + 200];

} ……

del_y[pos] = s[pos] – s[pos – 100];}

The data is stored as 2 bytes for each pixel, in a sequential row-wise order in the

memory. The arrangement of the data in memory is shown in figure 11. So, pixels 0-99

are stored sequentially as 2 bytes each, followed by pixels 100-199 and so on. As

observed from the code, the first and last derivatives of each row and column are

calculated using different masks than the intermediate derivatives. So, in this initial

implementation we use subword parallel operations for intermediate pixels {1, 2, 3, 4},

{5, 6, 7, 8} …. for the rows and {100, 200, 300, 400}, {101, 201, 301, 401} …. for the

columns. Thus, if we want to load pixels 1,2,3,4 in a register as shown in figure 11, we

need two loads from the memory as these pixels are not aligned with the 8 byte address

boundaries. Also for loading the column-wise data {100, 200, 300, 400} in a single

register, we face a problem as the data is arranged in row-wise order in the memory.

14

ECE 734 Final Project Report

Figure 11: Memory mapping for initial algorithm

The initial implementation has helped us to learn the basics of coding an

algorithm in PLX. We also verify our results with the C code by interfacing our PLX

algorithm with the C code. We calculate the speedup of our PLX implementation with

respect to the Intel x86 architecture. The following are the major issues we face during

our initial implementation.

i) Interfacing data with C code: we use short integer representation of

pixels in C, each of which requires 2 bytes. We use fread/fwrite to

read/write binary data from the C code to the PLX code.

ii) Loops are implemented using predicated jump instruction in PLX

iii) Load alignment problem: In PLX, the data needs to be loaded from a

multiple of 4 or 8 byte address to avoid trap. As we have seen our data is

slightly misaligned with this requirement.

15

ECE 734 Final Project Report

The initial implementation gives a 2X speedup in terms of the number of machine

cycles, with respect to the C code. But given the potential of subword parallelism, there is

scope for a lot of improvement. We can design better memory management schemes for

our algorithm to improve the efficiency of our algorithm and cache performance. In the

next section we describe our final hardware implementation optimized for better memory

management and faster performance.

7. Memory efficient hardware implementation In our new implementation, we perform both the x and y-derivative calculation in

the same loop. Thus we merge the two nested loops and then perform loop unfolding.

Also we perform 16*2 = 32 derivative calculations in a single iteration. The unfolded

algorithm code is shown below. Note that we have used vector representation to simplify

our code. Thus x[a: b: c] means {x[a], x[a + b], x[a + 2b], ….., x[c]}. And the notation

x[a: d] means {x[a], x[a + 1], x[a + 2], ….., x[d]}.

C code optimized for subword parallel hardware implementationfor(r=0; r < 25; r++){

pos = r * 100*4; for(c = 0; c < 25; c++, pos += 4) {

if(c == 0) { del_x[pos: 100: pos + 300) = s[pos + 1: 100: pos + 301] - s[pos: 100: pos + 300]; } else { del_x[pos: 100: pos + 300) = s[pos + 1: 100: pos + 301] - s[pos - 1: 100: pos + 299]; }

del_x[pos + 1: 100: pos + 301] = s[pos + 2: 100: pos+302] – s[pos: 100: pos + 300];

del_x[pos + 2: 100: pos + 302] = s[pos + 3: 100: pos + 303] – s[pos + 1: 100: pos + 301];

del_x[pos + 3: 100: pos + 303] = s[pos + 4: 100: pos + 304] – s[pos + 2: 100: pos + 302];

if(r == 0) { del_y[pos: pos + 3) =

16

ECE 734 Final Project Report

s[pos + 100: pos + 103] - s[pos: pos + 3]; } else { del_y[pos: pos + 3) = s[pos + 100: pos + 103] - s[pos - 100: pos - 97]; }

del_y[pos + 100: pos + 103] = s[pos + 200: pos+203] – s[pos: pos + 3];

del_y[pos + 200: pos + 203] = s[pos + 300: pos + 303] – s[pos + 100: pos + 103];

del_y[pos + 300: pos + 303] = s[pos + 400: pos + 403] – s[pos + 200: pos + 203];

}}

Using the above code we can perform a better memory management for our

algorithm. Thus, as shown in figure 12, we can load 4x4 blocks of pixel data from the

memory in four registers. We can calculate the y-derivative for those 16 pixels using

subword parallel operations. Then we perform matrix transpose operation in PLX to get

the data in column-wise order. So now, we can perform the x-derivative calculation for

all the 16 pixels. As the boundary pixels have different derivative mask, we parallelize

them along with the intermediate pixels, by using predicated execution. For details of this

implementation, please refer to our optimized PLX code in the Appendix. Since the

memory load is always from 8 byte aligned addresses, we can minimize memory access

time and have a better cache performance. For some boundary data needed between two

iterations, we perform local data communication, instead of reloading the data from the

memory.

17

ECE 734 Final Project Report

Figure 12: Memory mapping for optimized algorithm

8. Experimental results

Simulation VerificationOur first task is to verify the results of the PLX code with the results of the C

code. To do this, we dump the C results for derivative of pixels in binary format, and read

the data within PLX code and verify with the PLX generated data. In both our initial and

final implementation, the results of the PLX implementation exactly matched the results

of the C implementation. In figure 13, we show a snapshot of the PLX results and the

corresponding C results for y-derivative calculation. {-72, -48, -15, 11} are 8 bytes of

data for y-derivative. In hex they are {FFB8, FFD0, FFF1, 000B} which matches exactly

with the plx result in register R19 (in reverse order).

18

ECE 734 Final Project Report

Figure 13: Result verification snapshot for y derivative calculation

Calculation of performance speedupNext we measure the performance speedup of our optimized algorithm. We

calculate our speedup with two different baselines. At first we profile the C code, using

the AQtime 4.91 tool for visual studio .net. Using this tool we can generate Intel x86

assembly instructions, and calculate the machine cycles for our algorithm. Next we also

write a sequential code for our algorithm in plx without using subword parallelism. The

sequential plx code has been written to represent the sequential C code, to do

performance comparison for our optimized code, using the PLX timing simulator. This

code has also been provided in the Appendix. Table 2 shows the machine cycles required

for the Intel x86, sequential PLX and the optimized PLX codes. Our optimized PLX code

shows a speedup of 7.06X over the x86 code and 8.13X over the sequential PLX code.

Thus we could significantly speed up the partial derivative routine using memory

efficient subword parallel implementation. Also we measured the speedup of the whole

Canny’s algorithm using our plx optimized derivative routine. We got a 3.8% speedup for

19

ECE 734 Final Project Report

the entire algorithm (including all 5 routines) by optimizing the derivative calculation

routine using subword parallel implementation.

Routine Name % Time(cycle) Time(Cycle) Time with Children

magnitude_x_y 17.01% 1578363 2732808derrivative_x_y 4.28% 397269 849484gaussian_smooth 46.58% 4321857 5093960apply_hysteresis 17.72% 1644230 1727338non_max_supp 14.40% 1336618 1336618calloc 0.00% 191 276331SUM 9278337

Table 1: Performance of each function module in C code

Platform Intel x86 sequential PLX subword parallel PLX

machine cycles 397269 457881 56313

lines of code 768 63 75

Table 2: Performance comparison for the derivative function module

Speedup=

= X

9. Conclusion and future work With the advancement of technology, there has been an increasing demand for

high performance multimedia algorithms. Very low bit rate video coding techniques use

object recognition for image compression. Edge detection is a crucial part for object

recognition and thus we optimize the Canny’s edge detection algorithm using PLX

subword parallel ISA. We choose the partial derivative calculation stage of this algorithm

which is a crucial stage in every edge detection algorithm, and develop an efficient

implementation for this stage in PLX. We initially try to unfold the symmetric part of the

loop code, and find that symmetric unrolling results in small speedup due to memory

misalignment problem. Then we restructure our nested loops and unroll it to improve our

20

ECE 734 Final Project Report

memory management for subword parallel implementation. We verify our results with

the C code implementation and achieve a 7-8X performance speedup over the C

implementation. The techniques in this project have been specifically used to optimize

the second stage of canny’s algorithm. We believe that if we can generalize this method

for memory efficient PLX implementation for any nested loops, then our generalized

algorithm can be applied to develop a full PLX implementation of canny’s edge detection

algorithm or other edge detection algorithms. We will leave this part for the future work.

Also, we can improve the overall performance of the edge detection process by

employing the sophisticated pattern matching algorithm.

21

ECE 734 Final Project Report

10. References:[1] WanCheol Kim, et. al., “Efficient tracking of a Moving Object using Optimal

Representative Blocks”, Proceedings of the 2003 IEEE/ASME.

[2] Shyi-Chyi Cheng, “Visual Pattern Matching in Motion Estimation for Object-Based

Very Low Bit-Rate Coding Using Moment-Preserving Edge Detection”, IEEE

Transactions on Multimedia, Vol. 7, No 2, April 2005.

[3] P. Yahampath et. Al., “Detection of Moving Objects in Facial Image Sequences”,

IEEE, 1998.

[4] Ruby B. Lee and A. Murat Fiskiran, “PLX: A Fully Subword-Parallel Instruction Set

Architecture for Fast Scalable Multimedia Processing”, Proceedings of the 2002 IEEE

International Conference on Multimedia and Expo (ICME 2002), pp. 117-120, August

2002.

[5] Vishvjit S. Nalwa and Thomas O. Binford, “On Detecting Edges”, IEEE Transactions

on pattern analysis and machine intelligence, vol.PAMI-8, No 6, Nov, 1986

[6] John Canny, “A computational approach to edge detection”, IEEE Transactions on

pattern analysis and machine intelligence, vol. PAMI-8, No 6, Nov, 1986

[7] Sheu-Chih Cheng and Hsueh-Ming Hang, “A Comparison of Block-Matching

Algorithms Mapped to Systolic-Array Implementation”, IEEE Transactions on circuits

and systems for video technology, Vol. 7, No 5, Oct, 1997

[8] Heath, M.et al, “Edge Detection Comparison”, http://marathon.csee.usf.edu/edge/

edge_detection.html C code

[9] M. Heath, et al, “ A Robust Visual Method for Assessing the Relative Performance of

Edge-Detection Algorithms” IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol. 19, No. 12, Dec 1997, pp. 1338-1359

22

ECE 734 Final Project Report

11. Appendix

Optimized PLX code// Optimized PLX code using efficient memory mapping techniques// for reduced memory accesses and faster performance

///////consts#ifndef CPUDATAWIDTH #define CPUDATAWIDTH 64 //64-bit#endif

#define stop trap 0FFFFh

#define SRAMstart 0x10000 // start reading pixel values from this location #define DERxstart 0x30000 // start storing x derivatives from this location #define DERystart 0x50000 // start storing y derivatives from this location #define ROWS 0x64 // 100 rows#define COLS 0x64 // 100 columns #define TWOCOLS 0xC8 // 200#define EIGHTCOLS 0x320 //800#define SIXCOLS 0x258 //600

////////variables#define R R10 // row counter#define C R11 // column counter#define POS R12 #define COL R13#define DELX R14#define DELYR19#define START R17#define DXSTART R18#define DYSTART R20#define Rtemp1 R21#define Rtemp2 R6#define Rtemp3 R7#define Rtemp4 R15#define L0 R8#define L1 R1#define L2 R2#define L3 R3

23

ECE 734 Final Project Report

#define L4 R4#define L5 R5#define M1 R22#define M2 R23#define M3 R24#define M4 R25

////////

mov macro Rd,Rsori Rd,Rs,0endm

trans4x4 macro L11,L22,L33,L44mix.2.l Rtemp1,L11,L22mix.2.rRtemp2,L11,L22mix.2.l Rtemp3,L33,L44mix.2.rRtemp4,L33,L44mix.4.l L44,Rtemp1,Rtemp3mix.4.l L33,Rtemp2,Rtemp4mix.4.rL22,Rtemp1,Rtemp3mix.4.rL11,Rtemp2,Rtemp4endm

main proc

loadi.z.0 COL, COLSloadi.z.0 START, SRAMstart & 0xFFFFloadi.k.1 START, SRAMstart >> 16loadi.z.0 DXSTART, DERxstart & 0xFFFFloadi.k.1 DXSTART, DERxstart >> 16loadi.z.0 DYSTART, DERystart & 0xFFFFloadi.k.1 DYSTART, DERystart >> 16

loadi.z.0 R,0loadi.z.0 POS, 0x0000

rloop:

24

ECE 734 Final Project Report

loadi.z.0 C,0cloop:

// load data in 4X4 blocks to reduce memory accesses

// calculation of y derivative cmpi.eq R, 0x00, P1, P2 loadx.8 L1, START, POS

P1 ori L0, L1, 0P2 subi Rtemp1, POS, TWOCOLSP2 loadx.8 L0, START, Rtemp1

addi Rtemp1, POS, TWOCOLSloadx.8 L2, START, Rtemp1addi Rtemp1, Rtemp1, TWOCOLSloadx.8 L3, START, Rtemp1addi Rtemp1, Rtemp1, TWOCOLSloadx.8 L4, START, Rtemp1addi Rtemp1, Rtemp1, TWOCOLScmpi.eq R, 0x18, P1, P2

P2 loadx.8 L5, START, Rtemp1P1 ori L5, L4, 0

psub.2 DELY, L2, L0padd.8 Rtemp2, DYSTART, POSstore.8 DELY, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSpsub.2 DELY, L3, L1store.8 DELY, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSpsub.2 DELY, L4, L2store.8 DELY, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSpsub.2 DELY, L5, L3store.8 DELY, Rtemp2, 0

// transpose data using matrix transpose operation// for calculation of x derivative

trans4x4 L1, L2, L3, L4psub.2 M1, L2, L1psub.2 M2, L3, L1psub.2 M3, L4, L2

25

ECE 734 Final Project Report

psub.2 M4, L4, L3

trans4x4 M1, M2, M3, M4

padd.8 Rtemp2, DXSTART, POSstore.8 M4, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSstore.8 M3, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSstore.8 M2, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSstore.8 M1, Rtemp2, 0

addi POS, POS, 0x8addi C, C, 0x1cmpi.eq C, 0x19, P1, P2 // C = 1: 24, C < 25 (0x19)

(c=1:96)P2 jmp cloop

addi R, R, 0x1loadi.z.0 Rtemp1, EIGHTCOLSpmul.even POS, R, Rtemp1 // pos = r * cols * 8

cmpi.eq R, 0x19, P1, P2 // R = 0:24, R < 25 (100 rows in 4 blocks each)

P2 jmp rloop

stop

Sequential PLX code // sequential PLX code written without exploiting // subword parallelism to represent the C code section// for performance comparison purposes

///////consts#ifndef CPUDATAWIDTH #define CPUDATAWIDTH 64 //64-bit#endif

#define stop trap 0FFFFh

26

ECE 734 Final Project Report

#define SRAMstart 0x10000 // start reading pixel values from this location#define DERxstart 0x50000 // start storing x derivatives from this location #define DERystart 0x90000 // start storing y derivatives from this location #define ROWS 0x64 // 100 rows#define COLS 0x64 // 100 columns #define TWOCOLS 0xC8 // 200

////////variables#define R R10 // row counter#define C R11 // column counter#define POS R12 #define FOURR9#define COL R13#define DELX R14#define DELYR19#define FIRST R15#define SECOND R16 #define SECOND1 R4#define START R17#define DXSTART R18#define DYSTART R20#define Rtemp1 R1#define Rtemp2 R2#define Rtemp3 R3

////////

mov macro Rd,Rsori Rd,Rs,0endm

main proc

loadi.z.0 FOUR, 0x4loadi.z.0 COL, COLSloadi.z.0 START, SRAMstart & 0xFFFFloadi.k.1 START, SRAMstart >> 16loadi.z.0 DXSTART, DERxstart & 0xFFFFloadi.k.1 DXSTART, DERxstart >> 16

27

ECE 734 Final Project Report

loadi.z.0 DYSTART, DERystart & 0xFFFFloadi.k.1 DYSTART, DERystart >> 16

// starting derivative calculation in the x direction

loadi.z.0 R,0rloop:

pmul.even POS, R, COL // pos = r * cols * 4pmul.even POS, POS, FOURloadx.4 SECOND, START, POS addi POS, POS, 0x4loadx.4 FIRST, START, POSpsub.2 DELX, FIRST, SECONDstore.2.update DELX, DXSTART, 0x2 addi POS, POS, 0x4

loadi.z.0 C, 0x1cloop:

cmpi.eq C, 0x1, P1, P2P2 ori SECOND, SECOND1, 0

mov SECOND1, FIRSTloadx.4 FIRST, START, POSaddi POS, POS, 0x4

psub.2 DELX, FIRST, SECONDstore.2.update DELX, DXSTART, 0x2

addi C, C, 0x1cmpi.eq C, 0x62, P1, P2 // C = 1: 98, C < 99

P2 jmp cloop

psub.2 DELX, FIRST, SECOND1store.2.update DELX, DXSTART, 0x2

addi R, R, 0x1cmpi.eq R, 0x64, P1, P2 // R = 0:99, R < 100 (0x64)

P2 jmp rloop

28

ECE 734 Final Project Report

// starting derivative calculation in the y directiondery:

loadi.z.0 C,0

cyloop:mov POS, Cpmul.even POS, POS, FOURloadx.4 SECOND, START, POS padd.2 Rtemp2, DYSTART, POSaddi POS, POS, 0x190loadx.4 FIRST, START, POSpsub.2 DELY, FIRST, SECONDstore.4 DELY,Rtemp2 , 0addi POS, POS, 0x190

loadi.z.0 R, 0x1ryloop:

cmpi.eq R, 0x1, P1, P2P2 ori SECOND, SECOND1, 0

mov SECOND1, FIRSTloadx.4 FIRST, START, POSpsub.2 DELY, FIRST, SECONDsubi Rtemp2, POS, 0x190padd.2 Rtemp2, Rtemp2, DYSTARTstore.4 DELY, Rtemp2, 0 addi POS, POS, 0x190

addi R, R, 0x1cmpi.eq R, 0x62, P1, P2 // R = 1: 98 , R < 99

P2 jmp ryloop

psub.2 DELY, FIRST, SECOND1addi Rtemp2, Rtemp2, 0x190

29

ECE 734 Final Project Report

store.4 DELY, Rtemp2, 0 // R = 99

addi C, C, 0x1cmpi.eq C, 0x64, P1, P2 // C = 0:99 , C < 100 (0x64)

P2 jmp cyloop

stop

30