accelerating matrix product on reconfigurable hardware for image processing applications
TRANSCRIPT
Accelerating matrix product on reconfigurablehardware for image processing applications
F. Bensaali, A. Amira and A. Bouridane
Abstract: Matrix multiplication is very important in many types of applications including imageand signal processing. The suitability of reconfigurable hardware devices, in the form of fieldprogrammable gate arrays (FPGAs), is investigated as a low-cost solution for implementing twomatrix multipliers for 3-D affine transformations and colour space conversion. A first solutionbased on processing large matrix multiplication, for large 3-D models, and for the evaluation of theCeloxica fixed-point library and Xilinx CoreGen performance has been reported. A novelarchitecture for efficient implementation of a colour space converter (CSC) based on distributedarithmetic (DA) principles has been presented. The two multipliers have been developed andimplemented on the RC1000-PP Celoxica board-based development platform. Results show thatthe FPGA-based first parallel multiplier can achieve the performance of a graphics card whenperforming 3-D affine transformations, while the second multiplier, which is fully pipelined andplatform-independent, has a low latency (8 cycles) and is capable of a sustained data rate of over234 mega-conversions per second.
1 Introduction
Matrix algorithms are commonly used in the areas of graphtheory, numerical algorithms, digital control and signalprocessing. A close examination of these algorithms revealsthat many of the fundamental actions involve matrixoperations, such as matrix multiplication, which requiresenormous computing power. Moreover, medical imaging,3-D image manipulation, edge detection for object recogni-tion and other applications, involve large matrix multi-plication [1]. Multiplication of this type of matrices requiresa lot of computation time as its complexity is O(N3), whereN is the dimension of a square matrix. Because most currentapplications require higher computational throughputs,many researchers have tried to improve the performanceof matrix multiplication. Even with improvements such asStrassen’s algorithm-based partitioning for sequential ma-trix multiplication [2], performance is limited. For this rea-son, parallel approaches, which have complexity O(N3/p),when using p parallel processors, have been examined fordecades [1]. As part of an ongoing research project todevelop a hardware accelerator for image and signalprocessing algorithms based on matrix computations atQueen’s University of Belfast [3–5], this paper proposes:
� A parallel matrix multiplier for 3-D affine transforma-tions with the evaluation of the performance of the XilinxCoreGen [6] and Celoxica fixed-point library [7] for theimplementation.
� A novel architecture for colour space conversion basedon matrix–vector multiplication using DA principles, which
is a bit-level rearrangement of a multiply accumulate to hidethe multiplications. DA distributes arithmetic operationsrather than grouping them as multipliers do. ConventionalDA, called ROM-based DA, decomposes the variable inputof the inner product to bit-level in order to generateprecomputed data. ROM-based DA uses a ROM table tostore the precomputed data, which makes it regular andefficient in the use of the silicon area in a VLSIimplementation. The advantage of a DA-based ROMapproach is its efficiency of implementation. The basicoperations required are a sequence of ROMs, addition,subtraction and shift operations of the input data sequence[8]. Examples for the use of DA can be found in [8–10].
The target hardware for the implementation andverification of the proposed architectures is a CeloxicaRC1000-PP PCI-based FPGA development boardequipped with a Xilinx XCV2000E Virtex FPGA [7, 11].The structure of this paper can be split into three parts; thefirst part is concerned with the 3-D affine transformations,the second part is concerned with the colour space con-version, while a general conclusion is given in the third part.
2 3-D affine transformations
Computer graphics algorithms are generally computation-ally expensive. This fact is the reason why people struggle toaccelerate such algorithms using any reasonable means. Thetraditional sources of speedup are faster processors,parallelism or dedicated hardware. Recent advances indigital circuit technology, especially rapid development ofFPGAs, offer an alternative way to acceleration. Attemptsto implement such algorithms on FPGA have been thesubject of several researchers. In [12] various techniques forimproving cost effectiveness of graphics applications havebeen described. Methods for exploiting custom dataformats and datapath widths, and for optimising graphicsoperations, such as texture mapping and hidden-surface
The authors are with the School of Computer Science, Queen’s University ofBelfast, Belfast BT7 1NN, UK
r IEE, 2005
IEE Proceedings online no. 20040838
doi:10.1049/ip-cds:20040838
Paper first received 9th December 2003 and in revised form 7th June 2004.Originally published online: 3rd June 2005
236 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005
removal, have been studied. Customised architectures havebeen implemented on the Xilinx 4000 and Virtex FPGAs in[12] using Handel-C, a C-like language supporting paralle-lism and flexible data size. Singh and Bellec [13] have shownthat FPGAs can implement simple and complex graphicsalgorithms with a performance level that places themcomfortably between custom graphics chips and generalprocessors with specialised graphics instructions sets. A newmethod of hardware texture mapping in which textureimages are synthesised using FPGAs was presented in [14].The conclusion from this work was that, using FPGAs,procedural textures can be synthesised at high speed withlow hardware cost. It is the aim of this work to use FPGAsas a low-cost accelerator to develop and implement a matrixmultiplier for 3-D affine transformations using Handel-Cand to evaluate the performance of the Xilinx Coregen andCeloxica fixed-point library for the implementation.
2.1 ReviewIn computer graphics the most popular method forrepresenting an object is the polygon mesh model. In thesimplest case, a polygon mesh is a structure that consists ofpolygons represented by a list of (x, y, z) co-ordinates thatare the polygon vertices. Thus the information we store todescribe an object is finally a list of points or vertices [15](Fig. 1).
3-D affine transformations are the transformations thatinvolve rotation, scaling, shear and translation. A matrixcan represent an affine transformation and a set of affinetransformations can be combined into a single overall affinetransformation. Technically, it can be said that an affinetransformation is made up of any combination of lineartransformations (rotation, scaling and shear) followed bytranslation (technically, translation is not a linear transfor-mation) [15]. A set of vertices or three-dimensional pointsbelonging to an object can be transformed into another setof points by a linear transformation. Matrix notation isused in computer graphics to describe such transformations.Using matrix notation, a vertex V is transformed to V *
(* denotes the transformed vertex) under translation, scalingand rotation, which are the most commonly usedtransformations in computer graphics, as
V � ¼ Dþ V ; V � ¼ S � V ; V � ¼ R� V ð1ÞWhere D is a translation vector, S and R are the scaling androtation matrices [15, 16].
A uniform representation of all transformations in matrixnotation is necessary for implementing these transforma-tions in hardware. As it is not possible to describe thetranslation in matrix notation in Cartesian co-ordinates,homogeneous co-ordinates have to be used. But it is veryeasy to transform Cartesian into homogeneous co-ordinates
and vice versa. In a homogeneous system a vertex V(x, y, z)is presented as V (X, Y, Z, w) for any scale factor w¼ 0. Thethree-dimensional Cartesian co-ordinate representation isthen
x ¼ X=w; y ¼ Y =w; z ¼ Z=w ð2ÞIn computer graphics w is always taken to be one and thematrix representation of a point is (xyz1)T. Translation cannow be treated as a matrix multiplication operation, like theother two transformations, and becomes
x�
y�
z�
1
0BB@
1CCA ¼
1 0 0 Tx
0 1 0 Ty
0 0 1 Tz
0 0 0 1
0BB@
1CCA�
xyz1
0BB@
1CCA ¼ T �
xyz1
0BB@
1CCA ð3Þ
Therefore in homogeneous co-ordinates it is possible todescribe any transformation in a matrix notation:
x�
y�
z�
1
0BB@
1CCA ¼
A D G JB E H KC F I L0 0 0 1
0BB@
1CCA�
xyz1
0BB@
1CCA ð4Þ
This universal matrix for transformations can be dividedinto four function blocks:
scaling and rotation translationpart of the homogeneous representation 1
� �
ð5ÞThe matrix representations for the two other mostcommonly used transformations are as follows:
� Scaling:
V � ¼ S � V ð6Þ
S ¼
Sx 0 0 00 Sy 0 00 0 Sz 00 0 0 1
0BB@
1CCA ð7Þ
Here Sx, Sy, and Sz are the scaling factors. For a uniformscaling Sx¼Sy¼Sz.
� Rotation:
To rotate an object in a three-dimensional space, an axis ofrotation needs to be specified. This can have any spatialorientation in a three-dimensional space, but it is easier toconsider rotations that are parallel to one of the co-ordinateaxes. The transformation matrices for rotation about the X,
1
3
4
5
0
6
7
cube ABCDEF
polygon faces
0321
3762
x0, y0, z0x1, y1, z1x2, y2, z2x3, y3, z3x4, y4, z4x5, y5, z5x6, y6, z6x7, y7, z7
vertex lists vertices
2
Fig. 1 Data structure for object representation
IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005 237
Y and Z-axes, respectively are
Rx ¼
1 0 0 0
0 cosy siny 0
0 �siny cosy 0
0 0 0 1
0BBB@
1CCCA
Ry ¼
cosy 0 �siny 0
0 1 0 0
siny 0 cosy 0
0 0 0 1
0BBB@
1CCCA
Rz ¼
cosy siny 0 0
�siny cosy 0 0
0 0 1 0
0 0 0 1
0BBB@
1CCCA
ð8Þ
It is worth noting that a sequence of transformations can berepresented by one matrix T ¼ T1 � T2 � . . .� TN
Figure 2 shows different examples of a cube containingeight vertices when applying different transformations.
2.2 3-D affine transformations-based largematrix multiplicationConsider an object represented with N vertices. The newposition (NP) of the object when applying a transformationcan be calculated as follows:
NP ¼ T �OP ð9Þwhere T is the matrix transform, OP is a (4, N) matrixcontains the old vertices position and NP is a (4, N) matrixcontains the new vertices position.
x�0 x�1 . . . x�N�1y�0 y�1 . . . y�N�1z�0 z�1 . . . z�N�11 1 . . . 1
0BBB@
1CCCA
¼
A D G J
B E H K
C F I L
0 0 0 1
0BBB@
1CCCA�
x0 x1 . . . xN�1
y0 y1 . . . yN�1
z0 z1 . . . zN�1
1 1 . . . 1
0BBB@
1CCCA ð10Þ
Figure 3 shows two 3-D objects, a foot skeleton (N¼ 2154)and a face (N¼ 5597).
2.3 FPGA implementation
2.3.1 Implementation approach: Handel-C is ahigh-level language that is at the heart of a hardwarecompilation system known as the Celoxica DevelopmentKit (DK) [7], which is designed to compile programs written
y
z
x
x
y
z
y
z
xz-axis rotation
orginal position
y
z
x
x-axis scaling
y-axis translation
Fig. 2 3-D transformation examples
Fig. 3 Examples of 3-D objectsa Foot skeleton contains 2154 verticesb Face contains 5597 vertices
system-level model
HW/SWpartitioning
C code(host processor)
C compiler(MS visual C++)
handel-C code(FPGA hardware)
simulation celocixa DK2IDE
EDIFxilinx layout
tools
FPGA bitstream(full configuration)
xilinx JBitsFPGA
configuration
FPGAplace and route
FPGA bitstreampartial configuration
host processorprogram
host processorplatform
FPGA board
prototyping platform
real-timeprototyping
external cores(schematic, VHDL,
CoreGen...)
a
PCI
DMAbank 0
bank 1
bank 2
bank 3
8 bit
control
status
xcv2000E
b
Fig. 4 Hardware–software tools used for the implementationa Handel-C design flowb Schematic view of FPGA/banks part in the RC1000-PP board
238 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005
in a C-like high-level language into synchronous hardware.One of the advantages of using hardware is the ability toexploit parallelism directly. Because standard C is asequential language, Handel-C has additional constructsto support the parallelisation of code, and to allow finecontrol over what hardware is generated.
DK produces a Netlist file, which is used during the placeand route stage to generate the image or bitstream file [7](Fig. 4).
The RC1000-PP co-processor board used is a standardPCI bus card equipped with a large FPGA chip. It has8Mbytes of SRAM directly connected to the FPGA in four32-bit wide memory banks. All are accessible by the FPGAand any device on the PCI bus. Different methods of datatransfer from the host PC or the environment to the FPGAare available as follows:
� Bulk transfer of data between the FPGA and the PCI busis performed through the memory banks 0–3.
� Streams of bytes are most conveniently communicatedthrough the unidirectional 8-bit control and status ports(Fig. 4).
The RC1000-PP board is supported with a macro librarythat simplifies the process of initialising and talking to thehardware. This library comprises a set of driver functionswith the following functionality:
� initialisation and selection of a board
� handling of FPGA configuration files
� data transfer between PC and the RC1000-PP board
� function to help with error checking and debugging.
These library functions can be included in a C or C++program that runs on the host PC and performs datatransfer via the PCI bus [7].
Figure 5 shows the proposed parallel matrix multiplier(PMM), which can be used to perform the matrixmultiplication described in Section 3. The multiplier hasbeen implemented using Handel-C and compiled using DKversion 2 (DK2) on the RC1000-PP board. In order toperform this multiplication the four external memories havebeen exploited.
Since the vertex co-ordinates are real numbers, floating-point or fixed-point representations can be used. Celoxicaprovides two libraries (floating-point and fixed-point),which allow different widths to be specified, thus allowingdesigners to use the minimum number of bits to representdata and consequently generate smaller hardware. It can beseen from [17], when using the two libraries, that floating-point representation has large resource requirements andconsequently lower performance. If the range of realnumber values that must be represented is small, or canbe scaled in order to make it smaller, fixed-point arithmeticis one way of providing cheap fast non-integer support.Fixed-point arithmetic is appropriate for our applicationbecause the range of the values is small. The proposedarchitecture consists of p identical PEs, which should be amultiple of four. Each PE comprises a fixed-point multiply
22 bits
22 bits32 bits
22 bits
22 bits
PEp-3 PE1
PE2
PE3 SP3
SP2
SP1
SP4PE4
PEp-2
PEp-1
PEp
TBuf1
TBuf2
TBuf3
TBuf4
bank 0
bank 1
bank 2
bank 3
B3
C3
C2
C1
C0
B2
B1
B0
tik
Okj
fixed
-poi
ntm
ultip
lyac
cum
alat
or
SE
NPij PE : processor elementSP : storage processorSE : storage element
C0 C1 C2 C3
B0 B1 B2 B3
bloc
k 0
bloc
k 1
bloc
k 2
bloc
k 3
matrix OP
matrix T
matrix OP
row row row row
bloc
k 0
size
(N
/4)×
4bl
ock
1si
ze (
N/4
)×4
bloc
k 2
size
(N
/4)×
4bl
ock
3si
ze (
N/4
)×4
host
Fig. 5 Proposed parallel fixed-point matrix multiplier for 3-D affine transformations
IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005 239
accumulator (MAC) and a register for final result storage.The MAC has been implemented using two approaches:
(i) MAC-based Celoxica fixed-point library: This is a device-independent hardware library that allows the width of thefractional and integer part of the number to be defined andprovides macros to execute arithmetic operations [7].
(ii) MAC-based Xilinx CoreGen: Xilinx’s CoreGen utilitycontains many designs that can often save time for aprogrammer, and it is possible to integrate Xilinx CoreGenblocks with a program in Handel-C using the interfacedeclaration [6]. Two components have been used: a parallelsigned integer multiplier and a parallel signed integer adder,which are suitable for the Xilinx XCV2000E-6 Virtex-EFPGA (Fig. 6).
In both cases, the vertex co-ordinates are representedwith 22 bits (14 bits for the fractional part, seven bits for theinteger part, one sign bit). The input transform matrix T ispartitioned into four row-wise blocks, which gives one rowper block. Each block is stored in one of the four availablebanks. The matrix OP is partitioned into four columnwiseblocks, likewise each matrix T block is stored in one of thebanks. Because of the problem of accessing differentelements stored in the same SRAM simultaneously, fourbuffers (TBuf1, TBuf2, TBuf3, TBuf4) for storing the fourrows of T have been used to avoid a memory conflict. Dataare transferred from the banks to the buffers. Columns ofthe matrix OP are transferred from SRAMs to the PEs inparallel. Each PE computes one element of the outputmatrix NP. The four storage processors (SPs), which haveaccess to the PE registers, are used to transfer the finalresults to the banks and operate as an interface between thep PEs and the four memory banks. Each group of four PEsis working as a matrix–vector multiplier. Therefore, fourPEs are used to compute the new position of a vertex. Theentire computation of the matrix NP can be carried out in[2� (4�N)/p+BI+N/NB] clock cycles.
� 2 is the number of clock cycles needed by the multiplyaccumulator for one accumulation
� N is the number of object vertices
� p is the number of PEs used
� BI¼ 4 is the number of clock cycles needed for buffersinitialisation
� NB¼ 4 is the number of memory banks available
� N/NB is the number of clock cycles needed for final resultstorage.
It is worth nothing that the number of vertices N is alwaysrounded to a multiple of four. Therefore, the last partitionof the matrix OP should be padded with, at most, threecolumns of zeros (e.g. for N¼ 2154 two columns of zerosshould be supplemented to the matrix OP; in this case eachpartition has a size of 539� 4). Since the number of clock
cycles needed in order to perform a transformation iscontrolled by the slowest processor element, addingcolumns of zeros will not affect it.
Table 1 illustrates the performance obtained for theproposed architecture when using the two differentapproaches for the MAC implementation.
The buffers in our multiplier have been implementedusing look-up tables (LUTs). The multiplier can performtransformations to an object with the number of vertices Nup to 218. The implementation using the MAC-based XilinxCoreGen shows better performance when compared withthe one based on the Celoxica library due to the suitabilityof the cores used for the FPGA chip available in our board.
There exist expensive 3-D cards, which support themanipulation of co-ordinate transformations. Our PC(Pentium 4 CPU 2.00GHz) is equipped by an ATIRADEON FSC 32 MB graphics card, which belong tothis category of cards. This card delivers immerse, realisticcolour and 3-D graphics at the fastest possible frame rate.The RADEON Charisma Engine, which takes care of thegeometry processing, and the Pixel Tapestry Architecture,which is the rendering engine, support full transformation,clipping and lighting for improvement in 3-D details. Withfull support for DirectX and OpenGL, it accelerates alltoday’s top 3-D games.
Table 2 shows the performances obtained in terms ofminimum period and computation time for the RADEON
<< 14
logical shift right
signed integermultiplier signed integer
adder22 bits
22 bits44 bits 30 bits
32 bits
32 bits
Fig. 6 Fixed-point MAC
Table 1: Area/spead implementation report for the pro-posed parallel fixed-point matrix multiplier for two differentapproaches
MAC used Number of PEs Area Speed
% MHz
Celoxica fixed-point library 4 17 22
8 35 20
16 75 17
Xilinx CoreGen 4 10 35
8 20 30
16 42 24
Table 2: Computation time comparison of the proposedstructure with the RADEON FSC 32 MB graphics card
MAC used Numberof PEs
Minimumperiod
Computationtime
ms ms
FPGA Celoxica fixed-pointlibrary
4 1/22 220.47
8 1/20 134.825
16 1/17 31.905
Xilinx CoreGen 4 1/35 138.585
8 1/30 89.883
16 1/24 22.600
RADEONFSC 32 MBgraphicscard
– – 14.806
240 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005
graphics card and our FPGA implementation with differentnumber of PEs when performing a transformation on thefoot skeleton object, which contains 2154 vertices.
It can be seen from Table 2 that an improvement in theMAC will give better result in terms of computation time.Based on the number of clock cycles obtained for ourdesign, a 37MHz frequency (minimum period¼ 1/37ms)with p¼ 16, which gives a computation time less than thatobtained when using the graphics card, will be enough tooutperform it. The performance of the matrix multiplier,dedicated for 3-D affine transformations, demonstrates thatthe FPGA can be used as an effective low-cost solution.Although the RADEONFSC 32MB card is approximatelytwice as fast, a lower-cost alternative would be preferablefor an application which does not require additionalperformance.
2.3.2 Proposed environment for 3-D affinetransformations on FPGA: Figure 7 shows ageneral view of the entire proposed system. The environ-ment consists of a host application (GUI), a 3-D objectdatabase, the open graphics library (OpenGL) [18] and thesingle FPGA chip coprocessor based on the RC1000-PPdevelopment board.
� 3-D object database: contains the 3-D model files (.OBJ,.3DS, .ASE)
� OpenGL: is the specification of a powerful set of morethan 350 graphics routines for 2-D and 3-D graphicsprocessing. OpenGL includes facilities for:
defining and rendering 3-D primitives such as points,lines, polygons, spheres and cones
viewing 2-D projections of 3-D scenes
manipulating co-ordinate transformations
lighting; light sources, material properties
texture mapping.
� Coprocessor: performs the 3-D affine transformations.
� Host application (GUI): implemented using BorlandC++, gives the user the ability to select a 3-D model fromthe 3-D object database, and display it on the available 3-Dviewer. The user can apply different algorithms on theobject, such as texture, lighting and antialiasage, whichinvolve calls to the OpenGL functions. Since C++ doesnot support fixed-point formats, a floating-point to fixed-point converter has been implemented. The vertex co-ordinates are converted from floating-point to fixed-pointbefore performing the DMA transfer. The inverse operation
is applied to the result in order to reconstruct thetransformed 3-D model.
3 Colour space conversion
Colour is a visual sensation produced by the light in thevisible region of the spectrum incident on the retina. Sincethe human visual system has three types of colourphotoreceptor cone cells, three components are necessaryand sufficient to describe a colour [19].
Colour spaces (also called colour models or coloursystems) provide a standard method of defining andrepresenting colours. There are many existing colour spacesand most of them represent each colour as a point in a 3-Dco-ordinate system. Each colour space is optimised for awell defined application area [20].
The three most popular colour models are RGB (used incomputer graphics); YIQ, YUV and YCrCb (used in videosystems) and CMYK (used in colour printing). All of thecolour spaces can be derived from the RGB informationsupplied by devices such as cameras and scanners.
Processing an image in the RGB colour space, with a setof RGB values for each pixel is not the most efficientmethod. To speed up some processing steps, many broad-cast, video and imaging standards use luminance and colourdifference video signals, such as YCrCb, thus making amechanism for converting between formats necessary [21].Several cores for RGB to YCrCb conversion can be foundon the market, which have been designed for FPGAimplementation, such as the cores proposed by AmphionLtd [21], CAST Inc [22] and ALMA Tech [23].
It is the aim of this work to propose a novel architecturefor RGB to YCrCb colour space conversion based onmatrix–vector multiplication using DA ROM accumulatorprinciples. In the rest of this Section, the gamma-correctedRGB values are denoted R0G0B0.
3.1 Converting From R0G0B0 to Y 0CrCbDecomposing an R0G0B0 colour image into one luminanceimage and two chrominance images is the method that hasbeen used in most commercial applications, such as facedetection [24, 25], as well as the JPEG and MPEG imagingstandards [26, 27].
The calculation of Y0CrCb colour components fromR0G0B0 components consumes up to 40% of the processingpower in a highly optimised decoder [27]. Accelerating thisoperation would be useful for the acceleration of the wholeprocess.
open GL
-defining andrendering 3-Dprimitives-viewing 2-Dprojections of 3-Dscenes-lighting-texture mapping
3Dobjects
-object information:-vertices-faces-texture...
-transform matrix T-verices old positionmatrix (matrix OP)
vertices new positionmatrix (matrix NP)
DMAbank 0bank 1bank 2bank 3
XCV2000E
parallel floating-pointmatrix multiplier
graphical user interface
Fig. 7 Proposed system for 3-D affine transformations on FPGA
IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005 241
A colour in the R0G0B0 colour space is converted to theY0CrCb colour space using the following equation [23]:
Y 0
Cr
Cb
0B@
1CA ¼
0:257 0:504 0:098 16
0:439 �0:368 �0:071 128
�0:148 �0:291 0:439 128
0B@
1CA�
R0
G0
B0
1
0BBB@
1CCCA
ð11ÞWhile the inverse conversion can be carried out using thefollowing equation [23]:
R0
G0
B0
0@
1A ¼ 1:164 1:596 0:0 �222:912
1:164 �0:813 �0:392 135:6161:164 0:0 2:017 �276:8
0@
1A�
Y 0
CrCb1
0BB@
1CCA
ð12Þ
3.1.1 Mathematical background: Consider anL�M image (L image height, M image width) (Fig. 8).Let us represent each image pixel by bijk, (0rirL�1,0rjrM�1, 0rkr2) where
bij0 ¼ R0ij the red component of the pixel in row i and column jbij1 ¼ G0ij the green component of the pixel in row i and column jbij2 ¼ B0ij the blue component of the pixel in row i and column j
8<: Þ
ð13ÞThe image can be converted using the following mathema-tical formula:
c000c001c002
0B@
1CA . . .
c0ðM�1Þ0c0ðM�1Þ1c0ðM�1Þ2
0B@
1CA
c100c101c102
0B@
1CA . . .
c1ðM�1Þ0c1ðM�1Þ1c1ðM�1Þ2
0B@
1CA
..
.. . . ..
.
cðL�1Þ00cðL�1Þ01cðL�1Þ02
0B@
1CA . . .
cðL�1ÞðM�1Þ0cðL�1ÞðM�1Þ1cðL�1ÞðM�1Þ2
0B@
1CA
0BBBBBBBBBBBBBBBBBBB@
1CCCCCCCCCCCCCCCCCCCA
¼a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23
0B@
1CA�
b000b001b0021
0BBB@
1CCCA . . .
b0ðM�1Þ0b0ðM�1Þ1b0ðM�1Þ2
1
0BBB@
1CCCA
b100b101b1021
0BBB@
1CCCA . . .
b1ðM�1Þ0b1ðM�1Þ1b1ðM�1Þ2
1
0BBB@
1CCCA
..
.. . . ..
.
bðL�1Þ00bðL�1Þ01bðL�1Þ02
1
0BBB@
1CCCA . . .
bðL�1ÞðM�1Þ0bðL�1ÞðM�1Þ1bðL�1ÞðM�1Þ2
1
0BBB@
1CCCA
0BBBBBBBBBBBBBBBBBBBBBBBBBB@
1CCCCCCCCCCCCCCCCCCCCCCCCCCA
ð14Þwhere the operation # can be defined as follows:
Each vector
cij0
cij1
cij2
0@
1A
is the result of the product
a00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
0@
1A�
bij0
bij1
bij2
1
0BB@
1CCA
where cijk represent the output image colour spacecomponents and
A ¼a00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
0@
1A
represents one of the constant matrices in (11) and (12).The cijk elements can be computed using the following
equation:
cijk ¼X3m¼0
akm � bijm
ð0 � i � L� 1; 0 � j � M � 1; 0 � k � 2Þð15Þ
where {akm} are l-bit constants and {bijm} are written in theunsigned binary representation as shown in (16)
bijm ¼XW�1l¼0
bijm;l � 2l
ð0 � i � L� 1; 0 � j � M � 1; 0 � m � 2Þð16Þ
where bijm,l is the lth bit of bijm, which is zero or one, W is theword length used which represents the resolution for eachcolour component of a pixel.
Substituting (16) in (15)
cijk ¼X3m¼0
akm �XW�1l¼0
bijm;l � 2l
!
¼XW�1l¼0
X3m¼0
akm � bijm;l � 2l
!ð17Þ
Define
Zl ¼X3m¼0
akm � bijm;l ð18Þ
Therefore, cijk can be computed as
cijk ¼XW�1l¼0
Zl � 2l ð19Þ
The idea is that since the term Zl depends on the bijm,l valuesand has only 24 possible values, it is possible to precomputeand store them in ROMs. An input set of four bits (bij0,l,bij1,l, bij2,l, bij3,l) is used as an address to retrieve thecorresponding Zl value. The ROM content is different anddepends on the constant matrix A coefficients.
LM
LM
MB
GR
kCb
CrY
i i
j j
R′G′B′ imageR′G′B′ image Y′Cr Cb image Y′Cr Cb image
conversion
Fig. 8 R0G 0B 0 to Y 0CrCb conversion
242 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005
These intermediate results are accumulated in W clockcycles to produce cijk coefficients.
Since all the input components are in the range 0–255,eight bits (W¼ 8) are required to represent them. Equation(19) becomes
cijk ¼X7l¼0
Zl � 2l ð20Þ
m ¼ 0 m ¼ 1 m ¼ 2 m ¼ 3cijk ¼ ðak0 � bij0;0þ ak1 � bij1;0þ ak2 � bij2;0þ ak3 � bij3;0Þ �20þ l ¼ 0
ðak0 � bij0;1þ ak1 � bij1;1þ ak2 � bij2;1þ ak3 � bij3;1Þ �21þ l ¼ 1ðak0 � bij0;2þ ak1 � bij1;2þ ak2 � bij2;2þ ak3 � bij3;2Þ �22þ l ¼ 2
..
. ... ..
. ... ..
. ...
ðak0 � bij0;7þ ak1 � bij1;7þ ak2 � bij2;7þ ak3 � bij3;7Þ �27þ l ¼ 7
ð21ÞSince the element bij3 is always equal to one
bij3;l ¼1 for l ¼ 00 for l 6¼ 0
�ð22Þ
Equation (20) can be rewritten as
cijk ¼X7l¼0
Z�l � 2l þ ak3 ð23Þ
where
Z�l ¼X2m¼0
akm � bijm;l ð24Þ
m ¼ 0 m ¼ 1 m ¼ 2cijk ¼ ðak0 � bij0;0þ ak1 � bij1;0þ ak2 � bij2;0Þ �20þ l ¼ 0
ðak0 � bij0;1þ ak1 � bij1;1þ ak2 � bij2;1Þ �21þ l ¼ 1ðak0 � bij0;2þ ak1 � bij1;2þ ak2 � bij2;2Þ �22þ l ¼ 2
..
. ... ..
. ... ..
.
ðak0 � bij0;7þ ak1 � bij1;7þ ak2 � bij2;7Þ �27þ l ¼ 7ak3
¼ PPijk0þ PPijk1þ PPijk2þ PPijk3þPPijk4þ PPijk5þ PPijk6þ PPijk7þ
ak3
ð25ÞIt is worth mentioning that the size of the ROMs has beenreduced to 23. Table 3 gives the content of each ROM.
3.2 Proposed architectureEquation (23) can be mapped into the proposed architec-ture as shown in Fig. 9.
The architecture consists of eight identical PEns(0rnr7). Each PE comprises three parallel signed integeradders, three n right shifters and one ROMs block. EachROMs block consists of three ROMs with the size 23 each(Fig. 10). The ROM content is different and depends on thematrix A coefficients, which depend on the conversion type.
Table 3: Content of the ROM i
bi j 0,l bi j1,l bij2,l Content of the ROM i
0 0 0 0
0 0 1 ai2
0 1 0 ai1
0 1 1 ai1+ai2
1 0 0 ai0
1 0 1 ai0+ai2
1 1 0 ai0+ai1
1 1 1 ai0�ai1�ai2
bij0,0 bij1,0 bij2,0
bij0,1 bij1,1 bij2,1
bij0,2 bij1,2 bij2,2
bij0,3 bij1,3 bij2,3
bij0,4 bij1,4 bij2,4
bij0,5 bij1,5 bij2,5
bij0,6 bij1,6 bij2,6
bij0,7
cij0
cij1
cij2
bij1,7 bij2,7
3 ROMsblock
3 ROMsblock
3 ROMsblock
3 ROMsblock
3 ROMsblock
3 ROMs
delayPE: processor element
block3 ROMs
block3 ROMs
block
a03 + 0.5
a13 + 0.5
a23 + 0.5
<<1
<<1
<<2
<<2
<<2
<<7<<6<<5
<<5
<<5
<<4
<<4
<<4
<<3
<<3
<<3
<<2
<<2
<<2
<<1
<<1
<<1
<<6
<<6
<<7
<<7
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
PE
Fig. 9 Proposed architecture based on DA principles
RO
M1
RO
M2
RO
M3
Fig. 10 ROMs block structure
IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005 243
It is worth noting that the architecture has a latencyof W and a throughput rate equal to one. The entireimage conversion can be carried out in (Latency+(L�M)Throughput)¼ 8+(L�M) clock cycles, while usingthe standard algorithm (Fig. 11), the conversion can becarried out in (3� 4�L�M) clock cycles, where (3� 4) isthe constant matrix A size.
Figure 12 shows the functional analysis diagram of theproposed architecture.
Table 4 gives the content of the ROMs used for R0G0B0
to Y0CrCb conversions.The proposed architecture can be used for the inverse
conversion (Y0CrCb to R0G 0B 0) and for other conversionsbased on matrix–vector multiplication. The content of theROMs for the case of Y0CrCb to R0G0B0 conversion isshown in Table 5.
The precomputed partial products are stored in theROMs using 13-bit fixed-point representation (sevenbits for the integer part, one sign bit and five bits forthe fractional part). 13-bit arithmetic is used inside thearchitecture. The architecture’s inputs and outputs are
presented using eight bits and the outputs are rounded.Rounding usually looks at the decimal value and if it isgreater than or equal to 0.5, then the result is increased byone. This implies a condition of verifying followed byanother arithmetic operation. A more efficient way toround a number is to add 0.5 to the result and truncate thedecimal value. This technique has been applied in ourproposed architecture. The initial value for each first PE’sadder is set in advance to (ai3+0.5), where (0rir2).
3.3 FPGA implementationLike the previous implementation, the proposed architec-ture based on DA technique has been implemented andverified using the Celoxica RC1000-PP FPGA developmentboard. The architecture consumes 193 slices and can runwith a maximum clock frequency of 188MHz. The parallel-signed adders have been implemented using Xilinx’sCoreGen utility. The shifters and ROMs initialisation havebeen implemented using VHDL. In order to make a fairand consistent comparison with the existing FPGA-basedcolour space converters, the XCV50E-8 FPGA device has
end forend for
end forend for
for i 1 to L do1 to M do
// scanning image rows// scanning image columns// scanning the three RGB values of a pixel// scanning columns of the constant conversion matrix
1 to 3 do1 to 3 do
cijk + = akm x bijm
for jfor k
for k
Fig. 11 Pseudocode for the standard algorithm
PE1
PE2
PE3
PE4
PE5
PE6
PE7
PE8
1st CC 2nd CC 3rd CC ….. 7th CC 8th CC 9th CC
delay
delay
delay
delay
delay
delay
delay
delay
delay
delay
delay
delay
delay
delay
delay
delay
delaydelay delay
…...
…...
…...
…...
…...
…...
…...
…...
PP00k0 PP01k0
PP00k1
PP02k0
PP01k1
PP00k2
PP06k0
PP05k1
PP04k2
PP03k3
PP02k4
PP01k5
PP00k6
PP07k0
PP06k1
PP05k2
PP04k3
PP03k4
PP02k5
PP01k6
PP08k0
PP07k1
PP06k2
PP05k3
PP04k4
PP03k5
PP02k6
PP00k7 PP01k7
C00 C01 …. .
Fig. 12 Functional analysis diagram
Table 4: Content of the ROMs (R0G0B0 to Y0CrCb)
R 0ij0;l G0ij1;l B 0ij2;l ROM1 ROM2 ROM3
0 0 0 0 0 0
0 0 1 0.098 �0.071 0.439
0 1 0 0.504 �0.368 �0.291
0 1 1 0.602 �0.439 0.148
1 0 0 0.257 0.439 �0.148
1 0 1 0.355 0.368 0.291
1 1 0 0.761 0.071 �0.439
1 1 1 0.859 0 0
Table 5: Content of the ROMs (R0G0B0 to Y0CrCb)
Y 0ij0;l Crij1,l Cbij2,l ROM1 ROM2 ROM3
0 0 0 0 0 0
0 0 1 0 �0.392 0
0 1 0 1.596 �0.813 1.596
0 1 1 1.596 �1.025 1.596
1 0 0 1.164 1.164 1.164
1 0 1 1.164 0.772 1.164
1 1 0 2.76 0.351 2.76
1 1 1 2.76 �0.041 2.76
244 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005
been targeted. Table 6 illustrates the performances obtainedfor the proposed architecture in terms of area consumedand speed which can be achieved.
The proposed architecture shows significant improve-ments in comparison with the existing FPGA implementa-tions in terms of the area consumed and the throughputachieved. In addition, Fig. 13 illustrates the R0G0B0
‘Baboon’ image (512� 512) converted to the Y0CrCbformat using:
� FPGA implementation: the entire conversion image canbe carried out in approximately 1.2ms.
� Software-based implementation: using a 2.0GHz Pen-tium 4 processor with 1Gbyte of SDR RAM and C++Builder V5.0, the entire image conversion can be carried outin approximately 126ms.
It can be seen that the same converted image (like insoftware conversion) can be obtained faster when using theFPGA implementation.
In the second part of this paper, a novel independentplatform and fully pipelined architecture based on the DAapproach for R0G0B0 to Y0CrCb conversion has beenreported. The proposed architecture has a low latency and ahigh throughput rate. This novel architecture can be usedfor other conversions based on matrix–vector multiplicationby setting up the ROM content in advance. In addition thearchitecture has been implemented and verified using theCeloxica RC1000-PP FPGA development board.
4 Conclusions
FPGAs have grown in capacity, improved in performanceand decreased in cost. They have become a viable solutionfor performing computationally intensive tasks, with theability to tackle applications for custom chips andprogrammable DSP devices. Owing to the importance
and the use of matrix multiplication in many image andsignal processing applications, two matrix multipliers havebeen proposed for FPGA implementation in this paper.The first proposed multiplier is dedicated to 3-D affinetransformations, while the second one is for colour spaceconversion. The twomultipliers have been implemented andverified using the Celoxica RC1000-PP FPGA developmentboard. Results obtained for the first multiplier have shownthat a low-cost FPGA implementation can achieve theperformance of a graphics card when performing 3-D affinetransformations. The performance, in terms of the area usedand the maximum clock frequency, has been assessed forthe second multiplier and has shown that it can be run witha higher frequency and consumes less area when comparedwith existing systems.
5 References
1 Bensaali, F., Amira, A., and Bouridane, A.: ‘An FPGA basedcoprocessor for large matrix product implementation’. Proc. IEEE Int.Conf. on Field-Programmable Technology (FPT’03), Tokyo, Japan,December 2003, pp. 292–295
2 Huss-Lederman, S., Jacobson, E. M., Tsao, A., Turnbull, T., andJohnson, J.R.: ‘Implementation of Strassen’s algorithm for matrixmultiplication’. Presented at ACM/IEEE Conf. on Supercomputing,PA, USA, November 1996
3 Amira, A., Bouridane, A., Milligan, P., and Belatreche, A.: ‘Design ofefficient architectures for discrete orthogonal transforms using bit levelsystolic structures’, IEE Proc., Comput. Digit. Tech., 2002, 149, (1), pp.17–24
4 Bensaali, F., Amira, A., Uzun, I.S., and Ahmedsaid, A.: ‘An FPGAImplementation of 3-D affine transformations’. Proc. 10th IEEE Int.Conf. on Electronics, Circuits and Systems (ICECS’03), Sharjah,UAE, December 2003, Vol. 2, pp. 715–718
5 Bensaali, F., Amira, A., Uzun, I.S., and Ahmedsaid, A.: ‘Efficientimplementation of large parallel matrix product for DOTs’. Presentedat Int. Conf. on Computer, Communication and Control Technol-ogies (CCCT’03), FL, USA, July 2003
6 ‘Xilinx CoreGen and Handel-C,’ Application note AN 58 v1.0, 20017 URL: www.celoxica.com8 Amira, A., Bouridane, A., Milligan, P., and Roula, M.: ‘Novel FPGA
implementations of Walsh Hadamard transforms for signalprocessing’, IEE Proc., Vis. Image Signal Process., 2001, 148, (6),pp. 377–383
9 Ohlsson, H., and Wanhammer, L.: ‘Maximally fast numericallyequivalent state-space recursive digital filters using distributedarithmetic’. Proc. IEEE Nordic Signal Processing Symp. (NOR-SIG2000), Kolmarden, Sweden, June 2000, pp. 295–298
10 Gustafsson, O., and Wanhammar, L.: ‘Implementation of a digitalbeamformer in an FPGA using distributed arithmetic’. Proc. IEEENordic Signal Processing Symp. (NORSIG2000), Kolmarden,Sweden, June 2000, pp. 295–298
11 URL: www.xilinx.com12 Styles, H., and Luk, W.: ‘Customising graphics applications:techni-
ques and programming interface’. Proc. IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM), Napa, CA,USA, April 2000, pp. 77–87
13 Singh, S., and Bellec, P.: ‘Virtual hardware for graphics applicationsusing FPGAs’. Proc. IEEE Workshop on FPGAs for CustomComputing Machines, Los Alamitos, CA, USA, April 1994,pp. 49–58
Table 6: Performance comparison with existing CSC cores
Design parameters Slices Throughput(mega-conversion/s)
Proposed architecture 193 234
CAST, Inc [21] 222 112
ALMA, Tech [22] 222 105
Amphion Ltd [20] 204 90
Fig. 13 ‘Baboon’ (512� 512) test imagea Original R0G0B0 imageb Converted image in Y0CrCb format using the proposed systemc Software-based implementation (C++)
IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005 245
14 Ye, A.G., and Lewis, D.M.: ‘Procedural texture mapping on FPGAs’.Proc. ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays,Monterey, CA, USA, February 1999, pp. 112–120
15 Watt, A.: ‘3-D computer graphics’ (Addison–Wesley, 2000)16 Ferguson, R.S.: ‘Practical algorithms for 3-D computer graphics’
(A K Peters, 2001)17 Eadie, D., Shevlin, F., and Nisbet, A.: ‘Correction of geometric image
distortion using FPGAs’, Proc. SPIE - Int. Soc. Opt. Eng., 2003, 4877,pp. 28–37
18 URL: www.opengl.org19 B. Payette, ‘Color space converter: R0G0B0 to Y0CrCb’. Xilinx
Application Note, XAPP637, V1.0, September 200220 Gonzalez, R.C., andWoods, R.E.: ‘Digital image processing’ (Prentice
Hall Inc, 2002, 2nd edn.)21 ‘‘Color space converters,’’ Datasheet, Amphion Semiconductor Ltd,
(www.amphion.com) DS6400 V1.1, April 2002
22 ‘CSC color space converter’ Application note, CAST Inc, (www.cast-inc.com) April 2002
23 ‘High performance color space converter’. Datasheet, ALMATechnologies, (www.alma-tech.com) May 2002
24 Albiol, A., Torres, L., and Delp, E.J.: ‘An unsupervised color imagesegmentation algorithm for face detection applications’. Proc. Int.Conf. on Image Processing, October 2001, Vol. 2, pp. 681–684
25 Kuchi, P., Gabbur, P., Bhat, P.S., and David, S.: ‘Human facedetection and tracking using skin color modelling and connectedcomponent operators’, IETE J. Res., 2002, 48, pp. 289–293
26 Mitchell, J.L., and Pennebaker, W.B.: ‘MPEG video compressionstandard’ (Chapman & Hall, 1996)
27 Bartkowiak, M.: ‘Optimisations of color transformation for real timevideo decoding’. Presented at Digital Signal Processing forMultimediaCommunications and Services, EURASIP ECMCS 2001, Budapest,Hungary, September 2001
246 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005