ahomepages.cae.wisc.edu/~ece734/project/s06/roy_oh_re… · web viewobject recognition is used in...
TRANSCRIPT
ECE 734 Final Project Report
Acceleration of motion estimation by
edge detection algorithm using PLX
subword parallel ISAECE 734 Project Final Report
Submitted by Sanghamitra Roy and Dongkeun Oh
1
ECE 734 Final Project Report
Table of contents
1 Introduction and motivation--------------------------- 3
2 Overview of edge detection algorithm--------------- 4
3 Structure of Canny’s algorithm and its hardware
implementation------------------------------------------- 7
4 PLX subword parallel architecture: Overview------- 11
5 Algorithm for hardware implementation-------------- 11
6 Initial approach for hardware implementation-------- 12
7 Memory efficient hardware implementation --------- 15
8 Experimental results ------------------------------------- 17
9 Conclusion and future work ---------------------------- 19
10 References------------------------------------------------- 21
11 Appendix-------------------------------------------------- 22
3
ECE 734 Final Project Report
1. Introduction and motivation With rapid increase in the amount of multimedia information over the internet,
there has been a remarkable rise in the demand of video-driven applications such as
teleconference, videophone, and image-based multimedia services. Thus, the amount of
video information to be transmitted in the network has increased, although the
transmission rate in the network has not increased at the same rate. Hence, low bit-rate
video coding techniques have become necessary to ease these bottlenecks.
The low bit-rate video coding algorithms can be divided into two categories. The
first category consists of block-based algorithms such as H.261, H.263, MPEG-1, and
MPEG-2. These algorithms are easy to implement and maintain a relatively good image
quality at low bit rates. However, at very low bit rates, less than 28.8kbps, blocking and
mosquito artifacts become visible and the reconstructed image quality becomes degraded.
This is the reason why this strategy is not employed in MPEG-4. The second category is
object or segmentation based coding. Many techniques for object based coding at very
low bit rates have already been proposed. Object based coding achieves high
compression rate by subdividing an image into a number of arbitrarily shaped objects and
the background, and by performing motion estimation of objects. The greatest advantage
of this method is the ability to perform accurate motion estimation of moving objects and
utilize the available bit rate efficiently, by focusing on moving objects. Therefore, the
quality of images produced by this method varies dramatically depending on the quality
of object segmentation.
The object oriented approach supports high quality resolution for each individual
object. The accurate motion representation of the object is the key to good motion
compensation for coding purposes as well as for image format conversion. However,
most of the object-based coding approaches are computationally expensive. Object
segmentation and recognition is also a primary step of computer vision. Object
recognition is used in many areas such as traffic monitoring and robot vision. While a
single image provides a snapshot of a scene, different frames of a video taken over time
represent the dynamics in the scene, making it possible to capture the motion in the
4
ECE 734 Final Project Report
sequence. The recognition process of a moving object is processed in real time, which
requires high performance image processors.
Edge detection or object segmentation is the crucial part of object recognition.
Edge features, which are recognized as an important aspect of human visual perception,
are commonly used in shape analysis. Decomposition of images into two regions of low-
frequency blocks and blocks containing visually important features such as edges or lines
requires analysis of visual continuity of the image.
The objective of this project is to improve the computational power of an image
processor by accelerating the edge detection algorithm. We propose to enhance the
performance of the edge detection algorithm using sub-word parallelism, and implement
this algorithm using PLX subword parallel ISA.
2. Overview of Edge Detection algorithm An edge in an image corresponds to a discontinuity in the intensity surface of the
underlying scenes – a jump in intensity from one pixel to the next. Edge detecting
significantly reduces the amount of data and filters out useless information, while
preserving the important structural properties in an image. There are many ways to
perform edge detection. However, the majority of different methods may be grouped into
two categories, gradient and Laplacian. The gradient method detects the edges by looking
for the maximum and minimum in the first derivative of the image. The Laplacian
method searches for zero crossings in the second derivative of the image to find edges.
An edge has the one-dimensional shape of a ramp. Calculating the derivative of the
image can highlight its location. Suppose we have the following signal, with an edge
shown by the jump in intensity below in figure 1:
Figure 1: Intensity profile of pixels in 1D line
5
ECE 734 Final Project Report
If we take the gradient of this signal (which, in one dimension, is just the first
derivative with respect to t) we get the following as shown in figure 2:
Figure 2: 1st Derivative of pixel intensity
The derivative shows a maximum located at the center of the edge in the original
signal. This method of locating an edge is characteristic of gradient filter family of edge
detection filters. A pixel location is declared an edge location if the value of the gradient
exceeds some threshold. As mentioned before, pixels in edges will have higher intensity
values than those surrounding it. So once a threshold is set, we can compare the gradient
value to the threshold value and detect an edge whenever the threshold is exceeded.
Furthermore, when the first derivative is at a maximum, the second derivative is zero. As
a result, another alternative to finding the location of an edge is to locate the zeros in the
second derivative. This method is known as the Laplacian and the second derivative of
the signal is shown in figure 3:
Figure 3: 2nd Derivative of pixel intensity
6
ECE 734 Final Project Report
Based on this one-dimensional analysis, the theory can be carried over to two-dimension
as long as there is an accurate approximation to calculate the derivative of a two-
dimensional image.
The gradient of an image f(x,y ) at location (x,y) is defined as the vector
(1)
It is well known from vector analysis that the gradient vector points in the direction of
maximum rate of change of f at coordinates (x,y).
The magnitude of this vector
(2)
is an important quantity in the edge detection process which provides the maximum rate
of increase of f(x,y) per unit distance in the direction of . The direction of the gradient
vector
(3)
represents the direction angle of the vector at (x,y).
Computation of the gradient of an image is based on obtaining the partial derivative
and , at every pixel location. Let us consider the 3*3 area shown in Figure
4.
z1 z2 z3z4 z5 z6z7 z8 z9
Figure 4: 3*3 region of an image
7
ECE 734 Final Project Report
Figure 5: Masks based on (a) Prewitt (b) Sobel operator
At a pixel z5 of a 3*3 image, the first order derivative by Prewitt operator is given by
(4)
The magnitude using equation (2) is computationally expensive and thus we use an
approximate equation using absolute values.
(5)
3. Structure of Canny’s algorithm and its hardware implementation
Structure of canny’s algorithm
We select canny’s algorithm for the implementation of the edge detection process
because it is considered as a “standard method” of edge detection. The base program
source file is found at [8] and we modify this program to produce sample data, binary file
8
ECE 734 Final Project Report
for PLX simulation and simulate the whole edge detection process. Some sample outputs
of edge detection using our C code are shown in figure 7.
The program consists of 5 main function modules. Their function names are
gaussian_smooth, derivative_x_y, magnitude_x_y, non_max_supp, and apply_hysteresis.
In the first stage, it performs linear filtering with a Gaussian kernel to smooth the noise in
the image. Here, the pixel color data is converted into grayscale value. In the second
stage, computation of the edge strength and direction for each pixel in the smoothed
image is performed. This is done by differentiating the image in two orthogonal
directions and computing the partial derivatives. The third stage calculates the gradient
magnitude as the root sum of squares of the derivatives. The gradient direction is
computed using the arctangent of the ratio of the derivatives. We use the sum of absolute
value of the two derivatives to approximate the magnitude.
Figure 6: Derivative masks for Canny’s algorithm
9
ECE 734 Final Project Report
Figure 7: Original images and edge detected images using our C code
Figure 8: Block diagram of canny’s algorithm program using AQtime profiler
In the fourth stage, candidate edge pixels are identified as the pixels that survive a
thinning process called non-maximal suppression. In this process, the edge strength of
each candidate edge pixel is set to zero if its edge strength is not larger than the edge
strength of the two adjacent pixels in the gradient direction. Thresholding is then done on
the thinned edge magnitude image using hysteresis. In the fifth stage of hysteresis, two
edge strength thresholds are used. All candidate edge pixels below the lower threshold
are labeled as non-edges and all pixels above the lower threshold that can be connected to
any pixel above the higher threshold through a chain of edge pixels are labeled as edge
pixels.
Hardware implementation and Project purpose
The acceleration of this algorithm can be done in various ways. Upon inspection
of the various routines in the C code implementation, we find that the derivative_x_y
function module is a good candidate for being converted into PLX ISA using subword
10
ECE 734 Final Project Report
parallelism. Before proceeding to the hardware implementation of the derivative part of
(x,y) pixels, we briefly review the overall run time of the whole edge detection algorithm.
.
Figure 9: Snapshot of the profile of Canny’s program produced by AQtime
Figure 10: Run time weight of each program module in Canny’s Edge detection program
Figure 10 shows the runtime profile of canny’s program by the AQtime profiler. It
shows the percentage weight of run time for the whole body of the algorithm. From
figures 9 and 10, it can be observed that the gaussian smoothing stage occupies half of
the execution time of the program. This is because it uses a lot of multiplication and
division operations for each pixel. Therefore, the acceleration of this gaussian smoothing
may be done by using high performance multiplication and division modules. PLX ISA
does not directly support division process and so its hardware implementation goes
11
ECE 734 Final Project Report
beyond the scope of this project. The second stage of Canny’s algorithm, i.e., the
computation part of the derivative, can be the candidate of hardware acceleration using
subword parallelism because its loop structure is symmetric and it can be easily
parallelized into PLX subword ISA. Also this derivative calculation step is common to all
edge detection algorithms. So speeding up this step can benefit other edge detection
schemes too. All the other three function modules can be accelerated by using the same
approach. Thus, we will concentrate on the hardware implementation of this part and we
will leave implementation of the other parts for future work.
4. PLX subword parallel architecture: OverviewWe give a very brief overview of the PLX architecture in this section. We avoid
going into too much detail as this is already a part of our course. PLX is a small subword
parallel instruction set architecture developed at Princeton University. It uses SIMD type
of instructions for parallel operation and faster performance. PLX supports 1,2,4 or 8 byte
sub-words. It has 32 general purpose registers which may be 32, 64 or 128 bit wide. This
wordsize scalability allows design tradeoffs between performance and cost. PLX also
supports predicated execution. Reading or writing data from the memory requires using
aligned memory address (4/8/16 byte boundaries).
5. Algorithm for hardware implementation
We picked the partial derivative estimation algorithm for hardware
implementation and optimization using the PLX instruction set architecture. This
algorithm calculates the partial x-derivative and partial y-derivative for all pixel values of
the image. The algorithm contains basic addition and subtraction operations using the
derivative mask of figure 6. The C code snippet for this algorithm is shown below.
for(r=0; r < rows; r++){
for(c=0; c < cols; c++){
12
Computing y-derivativeComputing x-derivative
ECE 734 Final Project Report
pos = r * cols;del_x[pos] = s[pos + 1] – s[pos];
pos++for(c = 1; c < (cols – 1); c++, pos++) {
del_x[pos] = s[pos + 1] – s[pos – 1];
}del_x[pos] = s[pos] – s[pos – 1];
}
pos = c;del_y[pos] = s[pos + cols] – s[pos];
pos += cols;for(r = 1; r < (rows – 1); r++,
pos+= cols) {del_y[pos] =
s[pos + cols] – s[pos – cols];}del_y[pos] = s[pos] – s[pos – cols];
}
As it can be seen, the algorithm is mostly symmetric, except the first and last
calculations of every row, and column. We have the following objectives for
implementing this algorithm using PLX subword parallel architecture:
i) Explore parallelism in the algorithm by loop unrolling
ii) Minimize memory accesses to reduce execution time bottle neck
iii) Design memory access and data fetching to maximize cache hit
iv) Optimize the PLX code for faster performance
v) Verify accuracy of the PLX implementation with the C code
6. Initial approach for hardware implementation Our algorithms have been implemented for 100*100 pixel images having a total
of 10000 pixels. At first we implement the above algorithm by unfolding the symmetric
inner loops as shown below:
for(r=0; r < 100; r++){
pos = r * 100;del_x[pos] = s[pos + 1] – s[pos];
pos++for(c = 1; c < 25; c++, pos += 4) {
del_x[pos] = s[pos + 1] – s[pos – 1];
del_x[pos + 1] =
for(c=0; c < 100; c++){
pos = c;del_y[pos] = s[pos + 100] – s[pos];
pos += 100;for(r = 1; r < 25; r++,
pos+= 400) {del_y[pos] =
s[pos + 100] – s[pos – 100];
13
Loop unfolded x-derivative Loop unfolded y-derivative
ECE 734 Final Project Report
s[pos + 2] – s[pos];
del_x[pos + 2] = s[pos + 3] – s[pos + 1];
del_x[pos + 3] = s[pos + 4] – s[pos + 2];
} ……
del_x[pos] = s[pos] – s[pos – 1];}
del_y[pos + 100] = s[pos + 200] – s[pos];
del_y[pos + 200] = s[pos + 300] – s[pos + 100];
del_y[pos + 300] = s[pos + 400] – s[pos + 200];
} ……
del_y[pos] = s[pos] – s[pos – 100];}
The data is stored as 2 bytes for each pixel, in a sequential row-wise order in the
memory. The arrangement of the data in memory is shown in figure 11. So, pixels 0-99
are stored sequentially as 2 bytes each, followed by pixels 100-199 and so on. As
observed from the code, the first and last derivatives of each row and column are
calculated using different masks than the intermediate derivatives. So, in this initial
implementation we use subword parallel operations for intermediate pixels {1, 2, 3, 4},
{5, 6, 7, 8} …. for the rows and {100, 200, 300, 400}, {101, 201, 301, 401} …. for the
columns. Thus, if we want to load pixels 1,2,3,4 in a register as shown in figure 11, we
need two loads from the memory as these pixels are not aligned with the 8 byte address
boundaries. Also for loading the column-wise data {100, 200, 300, 400} in a single
register, we face a problem as the data is arranged in row-wise order in the memory.
14
ECE 734 Final Project Report
Figure 11: Memory mapping for initial algorithm
The initial implementation has helped us to learn the basics of coding an
algorithm in PLX. We also verify our results with the C code by interfacing our PLX
algorithm with the C code. We calculate the speedup of our PLX implementation with
respect to the Intel x86 architecture. The following are the major issues we face during
our initial implementation.
i) Interfacing data with C code: we use short integer representation of
pixels in C, each of which requires 2 bytes. We use fread/fwrite to
read/write binary data from the C code to the PLX code.
ii) Loops are implemented using predicated jump instruction in PLX
iii) Load alignment problem: In PLX, the data needs to be loaded from a
multiple of 4 or 8 byte address to avoid trap. As we have seen our data is
slightly misaligned with this requirement.
15
ECE 734 Final Project Report
The initial implementation gives a 2X speedup in terms of the number of machine
cycles, with respect to the C code. But given the potential of subword parallelism, there is
scope for a lot of improvement. We can design better memory management schemes for
our algorithm to improve the efficiency of our algorithm and cache performance. In the
next section we describe our final hardware implementation optimized for better memory
management and faster performance.
7. Memory efficient hardware implementation In our new implementation, we perform both the x and y-derivative calculation in
the same loop. Thus we merge the two nested loops and then perform loop unfolding.
Also we perform 16*2 = 32 derivative calculations in a single iteration. The unfolded
algorithm code is shown below. Note that we have used vector representation to simplify
our code. Thus x[a: b: c] means {x[a], x[a + b], x[a + 2b], ….., x[c]}. And the notation
x[a: d] means {x[a], x[a + 1], x[a + 2], ….., x[d]}.
C code optimized for subword parallel hardware implementationfor(r=0; r < 25; r++){
pos = r * 100*4; for(c = 0; c < 25; c++, pos += 4) {
if(c == 0) { del_x[pos: 100: pos + 300) = s[pos + 1: 100: pos + 301] - s[pos: 100: pos + 300]; } else { del_x[pos: 100: pos + 300) = s[pos + 1: 100: pos + 301] - s[pos - 1: 100: pos + 299]; }
del_x[pos + 1: 100: pos + 301] = s[pos + 2: 100: pos+302] – s[pos: 100: pos + 300];
del_x[pos + 2: 100: pos + 302] = s[pos + 3: 100: pos + 303] – s[pos + 1: 100: pos + 301];
del_x[pos + 3: 100: pos + 303] = s[pos + 4: 100: pos + 304] – s[pos + 2: 100: pos + 302];
if(r == 0) { del_y[pos: pos + 3) =
16
ECE 734 Final Project Report
s[pos + 100: pos + 103] - s[pos: pos + 3]; } else { del_y[pos: pos + 3) = s[pos + 100: pos + 103] - s[pos - 100: pos - 97]; }
del_y[pos + 100: pos + 103] = s[pos + 200: pos+203] – s[pos: pos + 3];
del_y[pos + 200: pos + 203] = s[pos + 300: pos + 303] – s[pos + 100: pos + 103];
del_y[pos + 300: pos + 303] = s[pos + 400: pos + 403] – s[pos + 200: pos + 203];
}}
Using the above code we can perform a better memory management for our
algorithm. Thus, as shown in figure 12, we can load 4x4 blocks of pixel data from the
memory in four registers. We can calculate the y-derivative for those 16 pixels using
subword parallel operations. Then we perform matrix transpose operation in PLX to get
the data in column-wise order. So now, we can perform the x-derivative calculation for
all the 16 pixels. As the boundary pixels have different derivative mask, we parallelize
them along with the intermediate pixels, by using predicated execution. For details of this
implementation, please refer to our optimized PLX code in the Appendix. Since the
memory load is always from 8 byte aligned addresses, we can minimize memory access
time and have a better cache performance. For some boundary data needed between two
iterations, we perform local data communication, instead of reloading the data from the
memory.
17
ECE 734 Final Project Report
Figure 12: Memory mapping for optimized algorithm
8. Experimental results
Simulation VerificationOur first task is to verify the results of the PLX code with the results of the C
code. To do this, we dump the C results for derivative of pixels in binary format, and read
the data within PLX code and verify with the PLX generated data. In both our initial and
final implementation, the results of the PLX implementation exactly matched the results
of the C implementation. In figure 13, we show a snapshot of the PLX results and the
corresponding C results for y-derivative calculation. {-72, -48, -15, 11} are 8 bytes of
data for y-derivative. In hex they are {FFB8, FFD0, FFF1, 000B} which matches exactly
with the plx result in register R19 (in reverse order).
18
ECE 734 Final Project Report
Figure 13: Result verification snapshot for y derivative calculation
Calculation of performance speedupNext we measure the performance speedup of our optimized algorithm. We
calculate our speedup with two different baselines. At first we profile the C code, using
the AQtime 4.91 tool for visual studio .net. Using this tool we can generate Intel x86
assembly instructions, and calculate the machine cycles for our algorithm. Next we also
write a sequential code for our algorithm in plx without using subword parallelism. The
sequential plx code has been written to represent the sequential C code, to do
performance comparison for our optimized code, using the PLX timing simulator. This
code has also been provided in the Appendix. Table 2 shows the machine cycles required
for the Intel x86, sequential PLX and the optimized PLX codes. Our optimized PLX code
shows a speedup of 7.06X over the x86 code and 8.13X over the sequential PLX code.
Thus we could significantly speed up the partial derivative routine using memory
efficient subword parallel implementation. Also we measured the speedup of the whole
Canny’s algorithm using our plx optimized derivative routine. We got a 3.8% speedup for
19
ECE 734 Final Project Report
the entire algorithm (including all 5 routines) by optimizing the derivative calculation
routine using subword parallel implementation.
Routine Name % Time(cycle) Time(Cycle) Time with Children
magnitude_x_y 17.01% 1578363 2732808derrivative_x_y 4.28% 397269 849484gaussian_smooth 46.58% 4321857 5093960apply_hysteresis 17.72% 1644230 1727338non_max_supp 14.40% 1336618 1336618calloc 0.00% 191 276331SUM 9278337
Table 1: Performance of each function module in C code
Platform Intel x86 sequential PLX subword parallel PLX
machine cycles 397269 457881 56313
lines of code 768 63 75
Table 2: Performance comparison for the derivative function module
Speedup=
= X
9. Conclusion and future work With the advancement of technology, there has been an increasing demand for
high performance multimedia algorithms. Very low bit rate video coding techniques use
object recognition for image compression. Edge detection is a crucial part for object
recognition and thus we optimize the Canny’s edge detection algorithm using PLX
subword parallel ISA. We choose the partial derivative calculation stage of this algorithm
which is a crucial stage in every edge detection algorithm, and develop an efficient
implementation for this stage in PLX. We initially try to unfold the symmetric part of the
loop code, and find that symmetric unrolling results in small speedup due to memory
misalignment problem. Then we restructure our nested loops and unroll it to improve our
20
ECE 734 Final Project Report
memory management for subword parallel implementation. We verify our results with
the C code implementation and achieve a 7-8X performance speedup over the C
implementation. The techniques in this project have been specifically used to optimize
the second stage of canny’s algorithm. We believe that if we can generalize this method
for memory efficient PLX implementation for any nested loops, then our generalized
algorithm can be applied to develop a full PLX implementation of canny’s edge detection
algorithm or other edge detection algorithms. We will leave this part for the future work.
Also, we can improve the overall performance of the edge detection process by
employing the sophisticated pattern matching algorithm.
21
ECE 734 Final Project Report
10. References:[1] WanCheol Kim, et. al., “Efficient tracking of a Moving Object using Optimal
Representative Blocks”, Proceedings of the 2003 IEEE/ASME.
[2] Shyi-Chyi Cheng, “Visual Pattern Matching in Motion Estimation for Object-Based
Very Low Bit-Rate Coding Using Moment-Preserving Edge Detection”, IEEE
Transactions on Multimedia, Vol. 7, No 2, April 2005.
[3] P. Yahampath et. Al., “Detection of Moving Objects in Facial Image Sequences”,
IEEE, 1998.
[4] Ruby B. Lee and A. Murat Fiskiran, “PLX: A Fully Subword-Parallel Instruction Set
Architecture for Fast Scalable Multimedia Processing”, Proceedings of the 2002 IEEE
International Conference on Multimedia and Expo (ICME 2002), pp. 117-120, August
2002.
[5] Vishvjit S. Nalwa and Thomas O. Binford, “On Detecting Edges”, IEEE Transactions
on pattern analysis and machine intelligence, vol.PAMI-8, No 6, Nov, 1986
[6] John Canny, “A computational approach to edge detection”, IEEE Transactions on
pattern analysis and machine intelligence, vol. PAMI-8, No 6, Nov, 1986
[7] Sheu-Chih Cheng and Hsueh-Ming Hang, “A Comparison of Block-Matching
Algorithms Mapped to Systolic-Array Implementation”, IEEE Transactions on circuits
and systems for video technology, Vol. 7, No 5, Oct, 1997
[8] Heath, M.et al, “Edge Detection Comparison”, http://marathon.csee.usf.edu/edge/
edge_detection.html C code
[9] M. Heath, et al, “ A Robust Visual Method for Assessing the Relative Performance of
Edge-Detection Algorithms” IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 19, No. 12, Dec 1997, pp. 1338-1359
22
ECE 734 Final Project Report
11. Appendix
Optimized PLX code// Optimized PLX code using efficient memory mapping techniques// for reduced memory accesses and faster performance
///////consts#ifndef CPUDATAWIDTH #define CPUDATAWIDTH 64 //64-bit#endif
#define stop trap 0FFFFh
#define SRAMstart 0x10000 // start reading pixel values from this location #define DERxstart 0x30000 // start storing x derivatives from this location #define DERystart 0x50000 // start storing y derivatives from this location #define ROWS 0x64 // 100 rows#define COLS 0x64 // 100 columns #define TWOCOLS 0xC8 // 200#define EIGHTCOLS 0x320 //800#define SIXCOLS 0x258 //600
////////variables#define R R10 // row counter#define C R11 // column counter#define POS R12 #define COL R13#define DELX R14#define DELYR19#define START R17#define DXSTART R18#define DYSTART R20#define Rtemp1 R21#define Rtemp2 R6#define Rtemp3 R7#define Rtemp4 R15#define L0 R8#define L1 R1#define L2 R2#define L3 R3
23
ECE 734 Final Project Report
#define L4 R4#define L5 R5#define M1 R22#define M2 R23#define M3 R24#define M4 R25
////////
mov macro Rd,Rsori Rd,Rs,0endm
trans4x4 macro L11,L22,L33,L44mix.2.l Rtemp1,L11,L22mix.2.rRtemp2,L11,L22mix.2.l Rtemp3,L33,L44mix.2.rRtemp4,L33,L44mix.4.l L44,Rtemp1,Rtemp3mix.4.l L33,Rtemp2,Rtemp4mix.4.rL22,Rtemp1,Rtemp3mix.4.rL11,Rtemp2,Rtemp4endm
main proc
loadi.z.0 COL, COLSloadi.z.0 START, SRAMstart & 0xFFFFloadi.k.1 START, SRAMstart >> 16loadi.z.0 DXSTART, DERxstart & 0xFFFFloadi.k.1 DXSTART, DERxstart >> 16loadi.z.0 DYSTART, DERystart & 0xFFFFloadi.k.1 DYSTART, DERystart >> 16
loadi.z.0 R,0loadi.z.0 POS, 0x0000
rloop:
24
ECE 734 Final Project Report
loadi.z.0 C,0cloop:
// load data in 4X4 blocks to reduce memory accesses
// calculation of y derivative cmpi.eq R, 0x00, P1, P2 loadx.8 L1, START, POS
P1 ori L0, L1, 0P2 subi Rtemp1, POS, TWOCOLSP2 loadx.8 L0, START, Rtemp1
addi Rtemp1, POS, TWOCOLSloadx.8 L2, START, Rtemp1addi Rtemp1, Rtemp1, TWOCOLSloadx.8 L3, START, Rtemp1addi Rtemp1, Rtemp1, TWOCOLSloadx.8 L4, START, Rtemp1addi Rtemp1, Rtemp1, TWOCOLScmpi.eq R, 0x18, P1, P2
P2 loadx.8 L5, START, Rtemp1P1 ori L5, L4, 0
psub.2 DELY, L2, L0padd.8 Rtemp2, DYSTART, POSstore.8 DELY, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSpsub.2 DELY, L3, L1store.8 DELY, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSpsub.2 DELY, L4, L2store.8 DELY, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSpsub.2 DELY, L5, L3store.8 DELY, Rtemp2, 0
// transpose data using matrix transpose operation// for calculation of x derivative
trans4x4 L1, L2, L3, L4psub.2 M1, L2, L1psub.2 M2, L3, L1psub.2 M3, L4, L2
25
ECE 734 Final Project Report
psub.2 M4, L4, L3
trans4x4 M1, M2, M3, M4
padd.8 Rtemp2, DXSTART, POSstore.8 M4, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSstore.8 M3, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSstore.8 M2, Rtemp2, 0addi Rtemp2, Rtemp2, TWOCOLSstore.8 M1, Rtemp2, 0
addi POS, POS, 0x8addi C, C, 0x1cmpi.eq C, 0x19, P1, P2 // C = 1: 24, C < 25 (0x19)
(c=1:96)P2 jmp cloop
addi R, R, 0x1loadi.z.0 Rtemp1, EIGHTCOLSpmul.even POS, R, Rtemp1 // pos = r * cols * 8
cmpi.eq R, 0x19, P1, P2 // R = 0:24, R < 25 (100 rows in 4 blocks each)
P2 jmp rloop
stop
Sequential PLX code // sequential PLX code written without exploiting // subword parallelism to represent the C code section// for performance comparison purposes
///////consts#ifndef CPUDATAWIDTH #define CPUDATAWIDTH 64 //64-bit#endif
#define stop trap 0FFFFh
26
ECE 734 Final Project Report
#define SRAMstart 0x10000 // start reading pixel values from this location#define DERxstart 0x50000 // start storing x derivatives from this location #define DERystart 0x90000 // start storing y derivatives from this location #define ROWS 0x64 // 100 rows#define COLS 0x64 // 100 columns #define TWOCOLS 0xC8 // 200
////////variables#define R R10 // row counter#define C R11 // column counter#define POS R12 #define FOURR9#define COL R13#define DELX R14#define DELYR19#define FIRST R15#define SECOND R16 #define SECOND1 R4#define START R17#define DXSTART R18#define DYSTART R20#define Rtemp1 R1#define Rtemp2 R2#define Rtemp3 R3
////////
mov macro Rd,Rsori Rd,Rs,0endm
main proc
loadi.z.0 FOUR, 0x4loadi.z.0 COL, COLSloadi.z.0 START, SRAMstart & 0xFFFFloadi.k.1 START, SRAMstart >> 16loadi.z.0 DXSTART, DERxstart & 0xFFFFloadi.k.1 DXSTART, DERxstart >> 16
27
ECE 734 Final Project Report
loadi.z.0 DYSTART, DERystart & 0xFFFFloadi.k.1 DYSTART, DERystart >> 16
// starting derivative calculation in the x direction
loadi.z.0 R,0rloop:
pmul.even POS, R, COL // pos = r * cols * 4pmul.even POS, POS, FOURloadx.4 SECOND, START, POS addi POS, POS, 0x4loadx.4 FIRST, START, POSpsub.2 DELX, FIRST, SECONDstore.2.update DELX, DXSTART, 0x2 addi POS, POS, 0x4
loadi.z.0 C, 0x1cloop:
cmpi.eq C, 0x1, P1, P2P2 ori SECOND, SECOND1, 0
mov SECOND1, FIRSTloadx.4 FIRST, START, POSaddi POS, POS, 0x4
psub.2 DELX, FIRST, SECONDstore.2.update DELX, DXSTART, 0x2
addi C, C, 0x1cmpi.eq C, 0x62, P1, P2 // C = 1: 98, C < 99
P2 jmp cloop
psub.2 DELX, FIRST, SECOND1store.2.update DELX, DXSTART, 0x2
addi R, R, 0x1cmpi.eq R, 0x64, P1, P2 // R = 0:99, R < 100 (0x64)
P2 jmp rloop
28
ECE 734 Final Project Report
// starting derivative calculation in the y directiondery:
loadi.z.0 C,0
cyloop:mov POS, Cpmul.even POS, POS, FOURloadx.4 SECOND, START, POS padd.2 Rtemp2, DYSTART, POSaddi POS, POS, 0x190loadx.4 FIRST, START, POSpsub.2 DELY, FIRST, SECONDstore.4 DELY,Rtemp2 , 0addi POS, POS, 0x190
loadi.z.0 R, 0x1ryloop:
cmpi.eq R, 0x1, P1, P2P2 ori SECOND, SECOND1, 0
mov SECOND1, FIRSTloadx.4 FIRST, START, POSpsub.2 DELY, FIRST, SECONDsubi Rtemp2, POS, 0x190padd.2 Rtemp2, Rtemp2, DYSTARTstore.4 DELY, Rtemp2, 0 addi POS, POS, 0x190
addi R, R, 0x1cmpi.eq R, 0x62, P1, P2 // R = 1: 98 , R < 99
P2 jmp ryloop
psub.2 DELY, FIRST, SECOND1addi Rtemp2, Rtemp2, 0x190
29