accelerating matrix product on reconfigurable hardware for image processing applications

11
Accelerating matrix product on reconfigurable hardware for image processing applications F. Bensaali, A. Amira and A. Bouridane Abstract: Matrix multiplication is very important in many types of applications including image and signal processing. The suitability of reconfigurable hardware devices, in the form of field programmable gate arrays (FPGAs), is investigated as a low-cost solution for implementing two matrix multipliers for 3-D affine transformations and colour space conversion. A first solution based on processing large matrix multiplication, for large 3-D models, and for the evaluation of the Celoxica fixed-point library and Xilinx CoreGen performance has been reported. A novel architecture for efficient implementation of a colour space converter (CSC) based on distributed arithmetic (DA) principles has been presented. The two multipliers have been developed and implemented on the RC1000-PP Celoxica board-based development platform. Results show that the FPGA-based first parallel multiplier can achieve the performance of a graphics card when performing 3-D affine transformations, while the second multiplier, which is fully pipelined and platform-independent, has a low latency (8 cycles) and is capable of a sustained data rate of over 234 mega-conversions per second. 1 Introduction Matrix algorithms are commonly used in the areas of graph theory, numerical algorithms, digital control and signal processing. A close examination of these algorithms reveals that many of the fundamental actions involve matrix operations, such as matrix multiplication, which requires enormous computing power. Moreover, medical imaging, 3-D image manipulation, edge detection for object recogni- tion and other applications, involve large matrix multi- plication [1] . Multiplication of this type of matrices requires a lot of computation time as its complexity is O(N 3 ), where N is the dimension of a square matrix. Because most current applications require higher computational throughputs, many researchers have tried to improve the performance of matrix multiplication. Even with improvements such as Strassen’s algorithm-based partitioning for sequential ma- trix multiplication [2] , performance is limited. For this rea- son, parallel approaches, which have complexity O(N 3 /p), when using p parallel processors, have been examined for decades [1]. As part of an ongoing research project to develop a hardware accelerator for image and signal processing algorithms based on matrix computations at Queen’s University of Belfast [3–5], this paper proposes: A parallel matrix multiplier for 3-D affine transforma- tions with the evaluation of the performance of the Xilinx CoreGen [6] and Celoxica fixed-point library [7] for the implementation. A novel architecture for colour space conversion based on matrix–vector multiplication using DA principles, which is a bit-level rearrangement of a multiply accumulate to hide the multiplications. DA distributes arithmetic operations rather than grouping them as multipliers do. Conventional DA, called ROM-based DA, decomposes the variable input of the inner product to bit-level in order to generate precomputed data. ROM-based DA uses a ROM table to store the precomputed data, which makes it regular and efficient in the use of the silicon area in a VLSI implementation. The advantage of a DA-based ROM approach is its efficiency of implementation. The basic operations required are a sequence of ROMs, addition, subtraction and shift operations of the input data sequence [8] . Examples for the use of DA can be found in [8–10] . The target hardware for the implementation and verification of the proposed architectures is a Celoxica RC1000-PP PCI-based FPGA development board equipped with a Xilinx XCV2000E Virtex FPGA [7, 11] . The structure of this paper can be split into three parts; the first part is concerned with the 3-D affine transformations, the second part is concerned with the colour space con- version, while a general conclusion is given in the third part. 2 3-D affine transformations Computer graphics algorithms are generally computation- ally expensive. This fact is the reason why people struggle to accelerate such algorithms using any reasonable means. The traditional sources of speedup are faster processors, parallelism or dedicated hardware. Recent advances in digital circuit technology, especially rapid development of FPGAs, offer an alternative way to acceleration. Attempts to implement such algorithms on FPGA have been the subject of several researchers. In [12] various techniques for improving cost effectiveness of graphics applications have been described. Methods for exploiting custom data formats and datapath widths, and for optimising graphics operations, such as texture mapping and hidden-surface The authors are with the School of Computer Science, Queen’s University of Belfast, Belfast BT7 1NN, UK r IEE, 2005 IEE Proceedings online no. 20040838 doi:10.1049/ip-cds:20040838 Paper first received 9th December 2003 and in revised form 7th June 2004. Originally published online: 3rd June 2005 236 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005

Upload: a

Post on 20-Sep-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating matrix product on reconfigurable hardware for image processing applications

Accelerating matrix product on reconfigurablehardware for image processing applications

F. Bensaali, A. Amira and A. Bouridane

Abstract: Matrix multiplication is very important in many types of applications including imageand signal processing. The suitability of reconfigurable hardware devices, in the form of fieldprogrammable gate arrays (FPGAs), is investigated as a low-cost solution for implementing twomatrix multipliers for 3-D affine transformations and colour space conversion. A first solutionbased on processing large matrix multiplication, for large 3-D models, and for the evaluation of theCeloxica fixed-point library and Xilinx CoreGen performance has been reported. A novelarchitecture for efficient implementation of a colour space converter (CSC) based on distributedarithmetic (DA) principles has been presented. The two multipliers have been developed andimplemented on the RC1000-PP Celoxica board-based development platform. Results show thatthe FPGA-based first parallel multiplier can achieve the performance of a graphics card whenperforming 3-D affine transformations, while the second multiplier, which is fully pipelined andplatform-independent, has a low latency (8 cycles) and is capable of a sustained data rate of over234 mega-conversions per second.

1 Introduction

Matrix algorithms are commonly used in the areas of graphtheory, numerical algorithms, digital control and signalprocessing. A close examination of these algorithms revealsthat many of the fundamental actions involve matrixoperations, such as matrix multiplication, which requiresenormous computing power. Moreover, medical imaging,3-D image manipulation, edge detection for object recogni-tion and other applications, involve large matrix multi-plication [1]. Multiplication of this type of matrices requiresa lot of computation time as its complexity is O(N3), whereN is the dimension of a square matrix. Because most currentapplications require higher computational throughputs,many researchers have tried to improve the performanceof matrix multiplication. Even with improvements such asStrassen’s algorithm-based partitioning for sequential ma-trix multiplication [2], performance is limited. For this rea-son, parallel approaches, which have complexity O(N3/p),when using p parallel processors, have been examined fordecades [1]. As part of an ongoing research project todevelop a hardware accelerator for image and signalprocessing algorithms based on matrix computations atQueen’s University of Belfast [3–5], this paper proposes:

� A parallel matrix multiplier for 3-D affine transforma-tions with the evaluation of the performance of the XilinxCoreGen [6] and Celoxica fixed-point library [7] for theimplementation.

� A novel architecture for colour space conversion basedon matrix–vector multiplication using DA principles, which

is a bit-level rearrangement of a multiply accumulate to hidethe multiplications. DA distributes arithmetic operationsrather than grouping them as multipliers do. ConventionalDA, called ROM-based DA, decomposes the variable inputof the inner product to bit-level in order to generateprecomputed data. ROM-based DA uses a ROM table tostore the precomputed data, which makes it regular andefficient in the use of the silicon area in a VLSIimplementation. The advantage of a DA-based ROMapproach is its efficiency of implementation. The basicoperations required are a sequence of ROMs, addition,subtraction and shift operations of the input data sequence[8]. Examples for the use of DA can be found in [8–10].

The target hardware for the implementation andverification of the proposed architectures is a CeloxicaRC1000-PP PCI-based FPGA development boardequipped with a Xilinx XCV2000E Virtex FPGA [7, 11].The structure of this paper can be split into three parts; thefirst part is concerned with the 3-D affine transformations,the second part is concerned with the colour space con-version, while a general conclusion is given in the third part.

2 3-D affine transformations

Computer graphics algorithms are generally computation-ally expensive. This fact is the reason why people struggle toaccelerate such algorithms using any reasonable means. Thetraditional sources of speedup are faster processors,parallelism or dedicated hardware. Recent advances indigital circuit technology, especially rapid development ofFPGAs, offer an alternative way to acceleration. Attemptsto implement such algorithms on FPGA have been thesubject of several researchers. In [12] various techniques forimproving cost effectiveness of graphics applications havebeen described. Methods for exploiting custom dataformats and datapath widths, and for optimising graphicsoperations, such as texture mapping and hidden-surface

The authors are with the School of Computer Science, Queen’s University ofBelfast, Belfast BT7 1NN, UK

r IEE, 2005

IEE Proceedings online no. 20040838

doi:10.1049/ip-cds:20040838

Paper first received 9th December 2003 and in revised form 7th June 2004.Originally published online: 3rd June 2005

236 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005

Page 2: Accelerating matrix product on reconfigurable hardware for image processing applications

removal, have been studied. Customised architectures havebeen implemented on the Xilinx 4000 and Virtex FPGAs in[12] using Handel-C, a C-like language supporting paralle-lism and flexible data size. Singh and Bellec [13] have shownthat FPGAs can implement simple and complex graphicsalgorithms with a performance level that places themcomfortably between custom graphics chips and generalprocessors with specialised graphics instructions sets. A newmethod of hardware texture mapping in which textureimages are synthesised using FPGAs was presented in [14].The conclusion from this work was that, using FPGAs,procedural textures can be synthesised at high speed withlow hardware cost. It is the aim of this work to use FPGAsas a low-cost accelerator to develop and implement a matrixmultiplier for 3-D affine transformations using Handel-Cand to evaluate the performance of the Xilinx Coregen andCeloxica fixed-point library for the implementation.

2.1 ReviewIn computer graphics the most popular method forrepresenting an object is the polygon mesh model. In thesimplest case, a polygon mesh is a structure that consists ofpolygons represented by a list of (x, y, z) co-ordinates thatare the polygon vertices. Thus the information we store todescribe an object is finally a list of points or vertices [15](Fig. 1).

3-D affine transformations are the transformations thatinvolve rotation, scaling, shear and translation. A matrixcan represent an affine transformation and a set of affinetransformations can be combined into a single overall affinetransformation. Technically, it can be said that an affinetransformation is made up of any combination of lineartransformations (rotation, scaling and shear) followed bytranslation (technically, translation is not a linear transfor-mation) [15]. A set of vertices or three-dimensional pointsbelonging to an object can be transformed into another setof points by a linear transformation. Matrix notation isused in computer graphics to describe such transformations.Using matrix notation, a vertex V is transformed to V *

(* denotes the transformed vertex) under translation, scalingand rotation, which are the most commonly usedtransformations in computer graphics, as

V � ¼ Dþ V ; V � ¼ S � V ; V � ¼ R� V ð1ÞWhere D is a translation vector, S and R are the scaling androtation matrices [15, 16].

A uniform representation of all transformations in matrixnotation is necessary for implementing these transforma-tions in hardware. As it is not possible to describe thetranslation in matrix notation in Cartesian co-ordinates,homogeneous co-ordinates have to be used. But it is veryeasy to transform Cartesian into homogeneous co-ordinates

and vice versa. In a homogeneous system a vertex V(x, y, z)is presented as V (X, Y, Z, w) for any scale factor w¼ 0. Thethree-dimensional Cartesian co-ordinate representation isthen

x ¼ X=w; y ¼ Y =w; z ¼ Z=w ð2ÞIn computer graphics w is always taken to be one and thematrix representation of a point is (xyz1)T. Translation cannow be treated as a matrix multiplication operation, like theother two transformations, and becomes

x�

y�

z�

1

0BB@

1CCA ¼

1 0 0 Tx

0 1 0 Ty

0 0 1 Tz

0 0 0 1

0BB@

1CCA�

xyz1

0BB@

1CCA ¼ T �

xyz1

0BB@

1CCA ð3Þ

Therefore in homogeneous co-ordinates it is possible todescribe any transformation in a matrix notation:

x�

y�

z�

1

0BB@

1CCA ¼

A D G JB E H KC F I L0 0 0 1

0BB@

1CCA�

xyz1

0BB@

1CCA ð4Þ

This universal matrix for transformations can be dividedinto four function blocks:

scaling and rotation translationpart of the homogeneous representation 1

� �

ð5ÞThe matrix representations for the two other mostcommonly used transformations are as follows:

� Scaling:

V � ¼ S � V ð6Þ

S ¼

Sx 0 0 00 Sy 0 00 0 Sz 00 0 0 1

0BB@

1CCA ð7Þ

Here Sx, Sy, and Sz are the scaling factors. For a uniformscaling Sx¼Sy¼Sz.

� Rotation:

To rotate an object in a three-dimensional space, an axis ofrotation needs to be specified. This can have any spatialorientation in a three-dimensional space, but it is easier toconsider rotations that are parallel to one of the co-ordinateaxes. The transformation matrices for rotation about the X,

1

3

4

5

0

6

7

cube ABCDEF

polygon faces

0321

3762

x0, y0, z0x1, y1, z1x2, y2, z2x3, y3, z3x4, y4, z4x5, y5, z5x6, y6, z6x7, y7, z7

vertex lists vertices

2

Fig. 1 Data structure for object representation

IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005 237

Page 3: Accelerating matrix product on reconfigurable hardware for image processing applications

Y and Z-axes, respectively are

Rx ¼

1 0 0 0

0 cosy siny 0

0 �siny cosy 0

0 0 0 1

0BBB@

1CCCA

Ry ¼

cosy 0 �siny 0

0 1 0 0

siny 0 cosy 0

0 0 0 1

0BBB@

1CCCA

Rz ¼

cosy siny 0 0

�siny cosy 0 0

0 0 1 0

0 0 0 1

0BBB@

1CCCA

ð8Þ

It is worth noting that a sequence of transformations can berepresented by one matrix T ¼ T1 � T2 � . . .� TN

Figure 2 shows different examples of a cube containingeight vertices when applying different transformations.

2.2 3-D affine transformations-based largematrix multiplicationConsider an object represented with N vertices. The newposition (NP) of the object when applying a transformationcan be calculated as follows:

NP ¼ T �OP ð9Þwhere T is the matrix transform, OP is a (4, N) matrixcontains the old vertices position and NP is a (4, N) matrixcontains the new vertices position.

x�0 x�1 . . . x�N�1y�0 y�1 . . . y�N�1z�0 z�1 . . . z�N�11 1 . . . 1

0BBB@

1CCCA

¼

A D G J

B E H K

C F I L

0 0 0 1

0BBB@

1CCCA�

x0 x1 . . . xN�1

y0 y1 . . . yN�1

z0 z1 . . . zN�1

1 1 . . . 1

0BBB@

1CCCA ð10Þ

Figure 3 shows two 3-D objects, a foot skeleton (N¼ 2154)and a face (N¼ 5597).

2.3 FPGA implementation

2.3.1 Implementation approach: Handel-C is ahigh-level language that is at the heart of a hardwarecompilation system known as the Celoxica DevelopmentKit (DK) [7], which is designed to compile programs written

y

z

x

x

y

z

y

z

xz-axis rotation

orginal position

y

z

x

x-axis scaling

y-axis translation

Fig. 2 3-D transformation examples

Fig. 3 Examples of 3-D objectsa Foot skeleton contains 2154 verticesb Face contains 5597 vertices

system-level model

HW/SWpartitioning

C code(host processor)

C compiler(MS visual C++)

handel-C code(FPGA hardware)

simulation celocixa DK2IDE

EDIFxilinx layout

tools

FPGA bitstream(full configuration)

xilinx JBitsFPGA

configuration

FPGAplace and route

FPGA bitstreampartial configuration

host processorprogram

host processorplatform

FPGA board

prototyping platform

real-timeprototyping

external cores(schematic, VHDL,

CoreGen...)

a

PCI

DMAbank 0

bank 1

bank 2

bank 3

8 bit

control

status

xcv2000E

b

Fig. 4 Hardware–software tools used for the implementationa Handel-C design flowb Schematic view of FPGA/banks part in the RC1000-PP board

238 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005

Page 4: Accelerating matrix product on reconfigurable hardware for image processing applications

in a C-like high-level language into synchronous hardware.One of the advantages of using hardware is the ability toexploit parallelism directly. Because standard C is asequential language, Handel-C has additional constructsto support the parallelisation of code, and to allow finecontrol over what hardware is generated.

DK produces a Netlist file, which is used during the placeand route stage to generate the image or bitstream file [7](Fig. 4).

The RC1000-PP co-processor board used is a standardPCI bus card equipped with a large FPGA chip. It has8Mbytes of SRAM directly connected to the FPGA in four32-bit wide memory banks. All are accessible by the FPGAand any device on the PCI bus. Different methods of datatransfer from the host PC or the environment to the FPGAare available as follows:

� Bulk transfer of data between the FPGA and the PCI busis performed through the memory banks 0–3.

� Streams of bytes are most conveniently communicatedthrough the unidirectional 8-bit control and status ports(Fig. 4).

The RC1000-PP board is supported with a macro librarythat simplifies the process of initialising and talking to thehardware. This library comprises a set of driver functionswith the following functionality:

� initialisation and selection of a board

� handling of FPGA configuration files

� data transfer between PC and the RC1000-PP board

� function to help with error checking and debugging.

These library functions can be included in a C or C++program that runs on the host PC and performs datatransfer via the PCI bus [7].

Figure 5 shows the proposed parallel matrix multiplier(PMM), which can be used to perform the matrixmultiplication described in Section 3. The multiplier hasbeen implemented using Handel-C and compiled using DKversion 2 (DK2) on the RC1000-PP board. In order toperform this multiplication the four external memories havebeen exploited.

Since the vertex co-ordinates are real numbers, floating-point or fixed-point representations can be used. Celoxicaprovides two libraries (floating-point and fixed-point),which allow different widths to be specified, thus allowingdesigners to use the minimum number of bits to representdata and consequently generate smaller hardware. It can beseen from [17], when using the two libraries, that floating-point representation has large resource requirements andconsequently lower performance. If the range of realnumber values that must be represented is small, or canbe scaled in order to make it smaller, fixed-point arithmeticis one way of providing cheap fast non-integer support.Fixed-point arithmetic is appropriate for our applicationbecause the range of the values is small. The proposedarchitecture consists of p identical PEs, which should be amultiple of four. Each PE comprises a fixed-point multiply

22 bits

22 bits32 bits

22 bits

22 bits

PEp-3 PE1

PE2

PE3 SP3

SP2

SP1

SP4PE4

PEp-2

PEp-1

PEp

TBuf1

TBuf2

TBuf3

TBuf4

bank 0

bank 1

bank 2

bank 3

B3

C3

C2

C1

C0

B2

B1

B0

tik

Okj

fixed

-poi

ntm

ultip

lyac

cum

alat

or

SE

NPij PE : processor elementSP : storage processorSE : storage element

C0 C1 C2 C3

B0 B1 B2 B3

bloc

k 0

bloc

k 1

bloc

k 2

bloc

k 3

matrix OP

matrix T

matrix OP

row row row row

bloc

k 0

size

(N

/4)×

4bl

ock

1si

ze (

N/4

)×4

bloc

k 2

size

(N

/4)×

4bl

ock

3si

ze (

N/4

)×4

host

Fig. 5 Proposed parallel fixed-point matrix multiplier for 3-D affine transformations

IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005 239

Page 5: Accelerating matrix product on reconfigurable hardware for image processing applications

accumulator (MAC) and a register for final result storage.The MAC has been implemented using two approaches:

(i) MAC-based Celoxica fixed-point library: This is a device-independent hardware library that allows the width of thefractional and integer part of the number to be defined andprovides macros to execute arithmetic operations [7].

(ii) MAC-based Xilinx CoreGen: Xilinx’s CoreGen utilitycontains many designs that can often save time for aprogrammer, and it is possible to integrate Xilinx CoreGenblocks with a program in Handel-C using the interfacedeclaration [6]. Two components have been used: a parallelsigned integer multiplier and a parallel signed integer adder,which are suitable for the Xilinx XCV2000E-6 Virtex-EFPGA (Fig. 6).

In both cases, the vertex co-ordinates are representedwith 22 bits (14 bits for the fractional part, seven bits for theinteger part, one sign bit). The input transform matrix T ispartitioned into four row-wise blocks, which gives one rowper block. Each block is stored in one of the four availablebanks. The matrix OP is partitioned into four columnwiseblocks, likewise each matrix T block is stored in one of thebanks. Because of the problem of accessing differentelements stored in the same SRAM simultaneously, fourbuffers (TBuf1, TBuf2, TBuf3, TBuf4) for storing the fourrows of T have been used to avoid a memory conflict. Dataare transferred from the banks to the buffers. Columns ofthe matrix OP are transferred from SRAMs to the PEs inparallel. Each PE computes one element of the outputmatrix NP. The four storage processors (SPs), which haveaccess to the PE registers, are used to transfer the finalresults to the banks and operate as an interface between thep PEs and the four memory banks. Each group of four PEsis working as a matrix–vector multiplier. Therefore, fourPEs are used to compute the new position of a vertex. Theentire computation of the matrix NP can be carried out in[2� (4�N)/p+BI+N/NB] clock cycles.

� 2 is the number of clock cycles needed by the multiplyaccumulator for one accumulation

� N is the number of object vertices

� p is the number of PEs used

� BI¼ 4 is the number of clock cycles needed for buffersinitialisation

� NB¼ 4 is the number of memory banks available

� N/NB is the number of clock cycles needed for final resultstorage.

It is worth nothing that the number of vertices N is alwaysrounded to a multiple of four. Therefore, the last partitionof the matrix OP should be padded with, at most, threecolumns of zeros (e.g. for N¼ 2154 two columns of zerosshould be supplemented to the matrix OP; in this case eachpartition has a size of 539� 4). Since the number of clock

cycles needed in order to perform a transformation iscontrolled by the slowest processor element, addingcolumns of zeros will not affect it.

Table 1 illustrates the performance obtained for theproposed architecture when using the two differentapproaches for the MAC implementation.

The buffers in our multiplier have been implementedusing look-up tables (LUTs). The multiplier can performtransformations to an object with the number of vertices Nup to 218. The implementation using the MAC-based XilinxCoreGen shows better performance when compared withthe one based on the Celoxica library due to the suitabilityof the cores used for the FPGA chip available in our board.

There exist expensive 3-D cards, which support themanipulation of co-ordinate transformations. Our PC(Pentium 4 CPU 2.00GHz) is equipped by an ATIRADEON FSC 32 MB graphics card, which belong tothis category of cards. This card delivers immerse, realisticcolour and 3-D graphics at the fastest possible frame rate.The RADEON Charisma Engine, which takes care of thegeometry processing, and the Pixel Tapestry Architecture,which is the rendering engine, support full transformation,clipping and lighting for improvement in 3-D details. Withfull support for DirectX and OpenGL, it accelerates alltoday’s top 3-D games.

Table 2 shows the performances obtained in terms ofminimum period and computation time for the RADEON

<< 14

logical shift right

signed integermultiplier signed integer

adder22 bits

22 bits44 bits 30 bits

32 bits

32 bits

Fig. 6 Fixed-point MAC

Table 1: Area/spead implementation report for the pro-posed parallel fixed-point matrix multiplier for two differentapproaches

MAC used Number of PEs Area Speed

% MHz

Celoxica fixed-point library 4 17 22

8 35 20

16 75 17

Xilinx CoreGen 4 10 35

8 20 30

16 42 24

Table 2: Computation time comparison of the proposedstructure with the RADEON FSC 32 MB graphics card

MAC used Numberof PEs

Minimumperiod

Computationtime

ms ms

FPGA Celoxica fixed-pointlibrary

4 1/22 220.47

8 1/20 134.825

16 1/17 31.905

Xilinx CoreGen 4 1/35 138.585

8 1/30 89.883

16 1/24 22.600

RADEONFSC 32 MBgraphicscard

– – 14.806

240 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005

Page 6: Accelerating matrix product on reconfigurable hardware for image processing applications

graphics card and our FPGA implementation with differentnumber of PEs when performing a transformation on thefoot skeleton object, which contains 2154 vertices.

It can be seen from Table 2 that an improvement in theMAC will give better result in terms of computation time.Based on the number of clock cycles obtained for ourdesign, a 37MHz frequency (minimum period¼ 1/37ms)with p¼ 16, which gives a computation time less than thatobtained when using the graphics card, will be enough tooutperform it. The performance of the matrix multiplier,dedicated for 3-D affine transformations, demonstrates thatthe FPGA can be used as an effective low-cost solution.Although the RADEONFSC 32MB card is approximatelytwice as fast, a lower-cost alternative would be preferablefor an application which does not require additionalperformance.

2.3.2 Proposed environment for 3-D affinetransformations on FPGA: Figure 7 shows ageneral view of the entire proposed system. The environ-ment consists of a host application (GUI), a 3-D objectdatabase, the open graphics library (OpenGL) [18] and thesingle FPGA chip coprocessor based on the RC1000-PPdevelopment board.

� 3-D object database: contains the 3-D model files (.OBJ,.3DS, .ASE)

� OpenGL: is the specification of a powerful set of morethan 350 graphics routines for 2-D and 3-D graphicsprocessing. OpenGL includes facilities for:

defining and rendering 3-D primitives such as points,lines, polygons, spheres and cones

viewing 2-D projections of 3-D scenes

manipulating co-ordinate transformations

lighting; light sources, material properties

texture mapping.

� Coprocessor: performs the 3-D affine transformations.

� Host application (GUI): implemented using BorlandC++, gives the user the ability to select a 3-D model fromthe 3-D object database, and display it on the available 3-Dviewer. The user can apply different algorithms on theobject, such as texture, lighting and antialiasage, whichinvolve calls to the OpenGL functions. Since C++ doesnot support fixed-point formats, a floating-point to fixed-point converter has been implemented. The vertex co-ordinates are converted from floating-point to fixed-pointbefore performing the DMA transfer. The inverse operation

is applied to the result in order to reconstruct thetransformed 3-D model.

3 Colour space conversion

Colour is a visual sensation produced by the light in thevisible region of the spectrum incident on the retina. Sincethe human visual system has three types of colourphotoreceptor cone cells, three components are necessaryand sufficient to describe a colour [19].

Colour spaces (also called colour models or coloursystems) provide a standard method of defining andrepresenting colours. There are many existing colour spacesand most of them represent each colour as a point in a 3-Dco-ordinate system. Each colour space is optimised for awell defined application area [20].

The three most popular colour models are RGB (used incomputer graphics); YIQ, YUV and YCrCb (used in videosystems) and CMYK (used in colour printing). All of thecolour spaces can be derived from the RGB informationsupplied by devices such as cameras and scanners.

Processing an image in the RGB colour space, with a setof RGB values for each pixel is not the most efficientmethod. To speed up some processing steps, many broad-cast, video and imaging standards use luminance and colourdifference video signals, such as YCrCb, thus making amechanism for converting between formats necessary [21].Several cores for RGB to YCrCb conversion can be foundon the market, which have been designed for FPGAimplementation, such as the cores proposed by AmphionLtd [21], CAST Inc [22] and ALMA Tech [23].

It is the aim of this work to propose a novel architecturefor RGB to YCrCb colour space conversion based onmatrix–vector multiplication using DA ROM accumulatorprinciples. In the rest of this Section, the gamma-correctedRGB values are denoted R0G0B0.

3.1 Converting From R0G0B0 to Y 0CrCbDecomposing an R0G0B0 colour image into one luminanceimage and two chrominance images is the method that hasbeen used in most commercial applications, such as facedetection [24, 25], as well as the JPEG and MPEG imagingstandards [26, 27].

The calculation of Y0CrCb colour components fromR0G0B0 components consumes up to 40% of the processingpower in a highly optimised decoder [27]. Accelerating thisoperation would be useful for the acceleration of the wholeprocess.

open GL

-defining andrendering 3-Dprimitives-viewing 2-Dprojections of 3-Dscenes-lighting-texture mapping

3Dobjects

-object information:-vertices-faces-texture...

-transform matrix T-verices old positionmatrix (matrix OP)

vertices new positionmatrix (matrix NP)

DMAbank 0bank 1bank 2bank 3

XCV2000E

parallel floating-pointmatrix multiplier

graphical user interface

Fig. 7 Proposed system for 3-D affine transformations on FPGA

IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005 241

Page 7: Accelerating matrix product on reconfigurable hardware for image processing applications

A colour in the R0G0B0 colour space is converted to theY0CrCb colour space using the following equation [23]:

Y 0

Cr

Cb

0B@

1CA ¼

0:257 0:504 0:098 16

0:439 �0:368 �0:071 128

�0:148 �0:291 0:439 128

0B@

1CA�

R0

G0

B0

1

0BBB@

1CCCA

ð11ÞWhile the inverse conversion can be carried out using thefollowing equation [23]:

R0

G0

B0

0@

1A ¼ 1:164 1:596 0:0 �222:912

1:164 �0:813 �0:392 135:6161:164 0:0 2:017 �276:8

0@

1A�

Y 0

CrCb1

0BB@

1CCA

ð12Þ

3.1.1 Mathematical background: Consider anL�M image (L image height, M image width) (Fig. 8).Let us represent each image pixel by bijk, (0rirL�1,0rjrM�1, 0rkr2) where

bij0 ¼ R0ij the red component of the pixel in row i and column jbij1 ¼ G0ij the green component of the pixel in row i and column jbij2 ¼ B0ij the blue component of the pixel in row i and column j

8<: Þ

ð13ÞThe image can be converted using the following mathema-tical formula:

c000c001c002

0B@

1CA . . .

c0ðM�1Þ0c0ðM�1Þ1c0ðM�1Þ2

0B@

1CA

c100c101c102

0B@

1CA . . .

c1ðM�1Þ0c1ðM�1Þ1c1ðM�1Þ2

0B@

1CA

..

.. . . ..

.

cðL�1Þ00cðL�1Þ01cðL�1Þ02

0B@

1CA . . .

cðL�1ÞðM�1Þ0cðL�1ÞðM�1Þ1cðL�1ÞðM�1Þ2

0B@

1CA

0BBBBBBBBBBBBBBBBBBB@

1CCCCCCCCCCCCCCCCCCCA

¼a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23

0B@

1CA�

b000b001b0021

0BBB@

1CCCA . . .

b0ðM�1Þ0b0ðM�1Þ1b0ðM�1Þ2

1

0BBB@

1CCCA

b100b101b1021

0BBB@

1CCCA . . .

b1ðM�1Þ0b1ðM�1Þ1b1ðM�1Þ2

1

0BBB@

1CCCA

..

.. . . ..

.

bðL�1Þ00bðL�1Þ01bðL�1Þ02

1

0BBB@

1CCCA . . .

bðL�1ÞðM�1Þ0bðL�1ÞðM�1Þ1bðL�1ÞðM�1Þ2

1

0BBB@

1CCCA

0BBBBBBBBBBBBBBBBBBBBBBBBBB@

1CCCCCCCCCCCCCCCCCCCCCCCCCCA

ð14Þwhere the operation # can be defined as follows:

Each vector

cij0

cij1

cij2

0@

1A

is the result of the product

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

0@

1A�

bij0

bij1

bij2

1

0BB@

1CCA

where cijk represent the output image colour spacecomponents and

A ¼a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

0@

1A

represents one of the constant matrices in (11) and (12).The cijk elements can be computed using the following

equation:

cijk ¼X3m¼0

akm � bijm

ð0 � i � L� 1; 0 � j � M � 1; 0 � k � 2Þð15Þ

where {akm} are l-bit constants and {bijm} are written in theunsigned binary representation as shown in (16)

bijm ¼XW�1l¼0

bijm;l � 2l

ð0 � i � L� 1; 0 � j � M � 1; 0 � m � 2Þð16Þ

where bijm,l is the lth bit of bijm, which is zero or one, W is theword length used which represents the resolution for eachcolour component of a pixel.

Substituting (16) in (15)

cijk ¼X3m¼0

akm �XW�1l¼0

bijm;l � 2l

!

¼XW�1l¼0

X3m¼0

akm � bijm;l � 2l

!ð17Þ

Define

Zl ¼X3m¼0

akm � bijm;l ð18Þ

Therefore, cijk can be computed as

cijk ¼XW�1l¼0

Zl � 2l ð19Þ

The idea is that since the term Zl depends on the bijm,l valuesand has only 24 possible values, it is possible to precomputeand store them in ROMs. An input set of four bits (bij0,l,bij1,l, bij2,l, bij3,l) is used as an address to retrieve thecorresponding Zl value. The ROM content is different anddepends on the constant matrix A coefficients.

LM

LM

MB

GR

kCb

CrY

i i

j j

R′G′B′ imageR′G′B′ image Y′Cr Cb image Y′Cr Cb image

conversion

Fig. 8 R0G 0B 0 to Y 0CrCb conversion

242 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005

Page 8: Accelerating matrix product on reconfigurable hardware for image processing applications

These intermediate results are accumulated in W clockcycles to produce cijk coefficients.

Since all the input components are in the range 0–255,eight bits (W¼ 8) are required to represent them. Equation(19) becomes

cijk ¼X7l¼0

Zl � 2l ð20Þ

m ¼ 0 m ¼ 1 m ¼ 2 m ¼ 3cijk ¼ ðak0 � bij0;0þ ak1 � bij1;0þ ak2 � bij2;0þ ak3 � bij3;0Þ �20þ l ¼ 0

ðak0 � bij0;1þ ak1 � bij1;1þ ak2 � bij2;1þ ak3 � bij3;1Þ �21þ l ¼ 1ðak0 � bij0;2þ ak1 � bij1;2þ ak2 � bij2;2þ ak3 � bij3;2Þ �22þ l ¼ 2

..

. ... ..

. ... ..

. ...

ðak0 � bij0;7þ ak1 � bij1;7þ ak2 � bij2;7þ ak3 � bij3;7Þ �27þ l ¼ 7

ð21ÞSince the element bij3 is always equal to one

bij3;l ¼1 for l ¼ 00 for l 6¼ 0

�ð22Þ

Equation (20) can be rewritten as

cijk ¼X7l¼0

Z�l � 2l þ ak3 ð23Þ

where

Z�l ¼X2m¼0

akm � bijm;l ð24Þ

m ¼ 0 m ¼ 1 m ¼ 2cijk ¼ ðak0 � bij0;0þ ak1 � bij1;0þ ak2 � bij2;0Þ �20þ l ¼ 0

ðak0 � bij0;1þ ak1 � bij1;1þ ak2 � bij2;1Þ �21þ l ¼ 1ðak0 � bij0;2þ ak1 � bij1;2þ ak2 � bij2;2Þ �22þ l ¼ 2

..

. ... ..

. ... ..

.

ðak0 � bij0;7þ ak1 � bij1;7þ ak2 � bij2;7Þ �27þ l ¼ 7ak3

¼ PPijk0þ PPijk1þ PPijk2þ PPijk3þPPijk4þ PPijk5þ PPijk6þ PPijk7þ

ak3

ð25ÞIt is worth mentioning that the size of the ROMs has beenreduced to 23. Table 3 gives the content of each ROM.

3.2 Proposed architectureEquation (23) can be mapped into the proposed architec-ture as shown in Fig. 9.

The architecture consists of eight identical PEns(0rnr7). Each PE comprises three parallel signed integeradders, three n right shifters and one ROMs block. EachROMs block consists of three ROMs with the size 23 each(Fig. 10). The ROM content is different and depends on thematrix A coefficients, which depend on the conversion type.

Table 3: Content of the ROM i

bi j 0,l bi j1,l bij2,l Content of the ROM i

0 0 0 0

0 0 1 ai2

0 1 0 ai1

0 1 1 ai1+ai2

1 0 0 ai0

1 0 1 ai0+ai2

1 1 0 ai0+ai1

1 1 1 ai0�ai1�ai2

bij0,0 bij1,0 bij2,0

bij0,1 bij1,1 bij2,1

bij0,2 bij1,2 bij2,2

bij0,3 bij1,3 bij2,3

bij0,4 bij1,4 bij2,4

bij0,5 bij1,5 bij2,5

bij0,6 bij1,6 bij2,6

bij0,7

cij0

cij1

cij2

bij1,7 bij2,7

3 ROMsblock

3 ROMsblock

3 ROMsblock

3 ROMsblock

3 ROMsblock

3 ROMs

delayPE: processor element

block3 ROMs

block3 ROMs

block

a03 + 0.5

a13 + 0.5

a23 + 0.5

<<1

<<1

<<2

<<2

<<2

<<7<<6<<5

<<5

<<5

<<4

<<4

<<4

<<3

<<3

<<3

<<2

<<2

<<2

<<1

<<1

<<1

<<6

<<6

<<7

<<7

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

PE

Fig. 9 Proposed architecture based on DA principles

RO

M1

RO

M2

RO

M3

Fig. 10 ROMs block structure

IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005 243

Page 9: Accelerating matrix product on reconfigurable hardware for image processing applications

It is worth noting that the architecture has a latencyof W and a throughput rate equal to one. The entireimage conversion can be carried out in (Latency+(L�M)Throughput)¼ 8+(L�M) clock cycles, while usingthe standard algorithm (Fig. 11), the conversion can becarried out in (3� 4�L�M) clock cycles, where (3� 4) isthe constant matrix A size.

Figure 12 shows the functional analysis diagram of theproposed architecture.

Table 4 gives the content of the ROMs used for R0G0B0

to Y0CrCb conversions.The proposed architecture can be used for the inverse

conversion (Y0CrCb to R0G 0B 0) and for other conversionsbased on matrix–vector multiplication. The content of theROMs for the case of Y0CrCb to R0G0B0 conversion isshown in Table 5.

The precomputed partial products are stored in theROMs using 13-bit fixed-point representation (sevenbits for the integer part, one sign bit and five bits forthe fractional part). 13-bit arithmetic is used inside thearchitecture. The architecture’s inputs and outputs are

presented using eight bits and the outputs are rounded.Rounding usually looks at the decimal value and if it isgreater than or equal to 0.5, then the result is increased byone. This implies a condition of verifying followed byanother arithmetic operation. A more efficient way toround a number is to add 0.5 to the result and truncate thedecimal value. This technique has been applied in ourproposed architecture. The initial value for each first PE’sadder is set in advance to (ai3+0.5), where (0rir2).

3.3 FPGA implementationLike the previous implementation, the proposed architec-ture based on DA technique has been implemented andverified using the Celoxica RC1000-PP FPGA developmentboard. The architecture consumes 193 slices and can runwith a maximum clock frequency of 188MHz. The parallel-signed adders have been implemented using Xilinx’sCoreGen utility. The shifters and ROMs initialisation havebeen implemented using VHDL. In order to make a fairand consistent comparison with the existing FPGA-basedcolour space converters, the XCV50E-8 FPGA device has

end forend for

end forend for

for i 1 to L do1 to M do

// scanning image rows// scanning image columns// scanning the three RGB values of a pixel// scanning columns of the constant conversion matrix

1 to 3 do1 to 3 do

cijk + = akm x bijm

for jfor k

for k

Fig. 11 Pseudocode for the standard algorithm

PE1

PE2

PE3

PE4

PE5

PE6

PE7

PE8

1st CC 2nd CC 3rd CC ….. 7th CC 8th CC 9th CC

delay

delay

delay

delay

delay

delay

delay

delay

delay

delay

delay

delay

delay

delay

delay

delay

delaydelay delay

…...

…...

…...

…...

…...

…...

…...

…...

PP00k0 PP01k0

PP00k1

PP02k0

PP01k1

PP00k2

PP06k0

PP05k1

PP04k2

PP03k3

PP02k4

PP01k5

PP00k6

PP07k0

PP06k1

PP05k2

PP04k3

PP03k4

PP02k5

PP01k6

PP08k0

PP07k1

PP06k2

PP05k3

PP04k4

PP03k5

PP02k6

PP00k7 PP01k7

C00 C01 …. .

Fig. 12 Functional analysis diagram

Table 4: Content of the ROMs (R0G0B0 to Y0CrCb)

R 0ij0;l G0ij1;l B 0ij2;l ROM1 ROM2 ROM3

0 0 0 0 0 0

0 0 1 0.098 �0.071 0.439

0 1 0 0.504 �0.368 �0.291

0 1 1 0.602 �0.439 0.148

1 0 0 0.257 0.439 �0.148

1 0 1 0.355 0.368 0.291

1 1 0 0.761 0.071 �0.439

1 1 1 0.859 0 0

Table 5: Content of the ROMs (R0G0B0 to Y0CrCb)

Y 0ij0;l Crij1,l Cbij2,l ROM1 ROM2 ROM3

0 0 0 0 0 0

0 0 1 0 �0.392 0

0 1 0 1.596 �0.813 1.596

0 1 1 1.596 �1.025 1.596

1 0 0 1.164 1.164 1.164

1 0 1 1.164 0.772 1.164

1 1 0 2.76 0.351 2.76

1 1 1 2.76 �0.041 2.76

244 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005

Page 10: Accelerating matrix product on reconfigurable hardware for image processing applications

been targeted. Table 6 illustrates the performances obtainedfor the proposed architecture in terms of area consumedand speed which can be achieved.

The proposed architecture shows significant improve-ments in comparison with the existing FPGA implementa-tions in terms of the area consumed and the throughputachieved. In addition, Fig. 13 illustrates the R0G0B0

‘Baboon’ image (512� 512) converted to the Y0CrCbformat using:

� FPGA implementation: the entire conversion image canbe carried out in approximately 1.2ms.

� Software-based implementation: using a 2.0GHz Pen-tium 4 processor with 1Gbyte of SDR RAM and C++Builder V5.0, the entire image conversion can be carried outin approximately 126ms.

It can be seen that the same converted image (like insoftware conversion) can be obtained faster when using theFPGA implementation.

In the second part of this paper, a novel independentplatform and fully pipelined architecture based on the DAapproach for R0G0B0 to Y0CrCb conversion has beenreported. The proposed architecture has a low latency and ahigh throughput rate. This novel architecture can be usedfor other conversions based on matrix–vector multiplicationby setting up the ROM content in advance. In addition thearchitecture has been implemented and verified using theCeloxica RC1000-PP FPGA development board.

4 Conclusions

FPGAs have grown in capacity, improved in performanceand decreased in cost. They have become a viable solutionfor performing computationally intensive tasks, with theability to tackle applications for custom chips andprogrammable DSP devices. Owing to the importance

and the use of matrix multiplication in many image andsignal processing applications, two matrix multipliers havebeen proposed for FPGA implementation in this paper.The first proposed multiplier is dedicated to 3-D affinetransformations, while the second one is for colour spaceconversion. The twomultipliers have been implemented andverified using the Celoxica RC1000-PP FPGA developmentboard. Results obtained for the first multiplier have shownthat a low-cost FPGA implementation can achieve theperformance of a graphics card when performing 3-D affinetransformations. The performance, in terms of the area usedand the maximum clock frequency, has been assessed forthe second multiplier and has shown that it can be run witha higher frequency and consumes less area when comparedwith existing systems.

5 References

1 Bensaali, F., Amira, A., and Bouridane, A.: ‘An FPGA basedcoprocessor for large matrix product implementation’. Proc. IEEE Int.Conf. on Field-Programmable Technology (FPT’03), Tokyo, Japan,December 2003, pp. 292–295

2 Huss-Lederman, S., Jacobson, E. M., Tsao, A., Turnbull, T., andJohnson, J.R.: ‘Implementation of Strassen’s algorithm for matrixmultiplication’. Presented at ACM/IEEE Conf. on Supercomputing,PA, USA, November 1996

3 Amira, A., Bouridane, A., Milligan, P., and Belatreche, A.: ‘Design ofefficient architectures for discrete orthogonal transforms using bit levelsystolic structures’, IEE Proc., Comput. Digit. Tech., 2002, 149, (1), pp.17–24

4 Bensaali, F., Amira, A., Uzun, I.S., and Ahmedsaid, A.: ‘An FPGAImplementation of 3-D affine transformations’. Proc. 10th IEEE Int.Conf. on Electronics, Circuits and Systems (ICECS’03), Sharjah,UAE, December 2003, Vol. 2, pp. 715–718

5 Bensaali, F., Amira, A., Uzun, I.S., and Ahmedsaid, A.: ‘Efficientimplementation of large parallel matrix product for DOTs’. Presentedat Int. Conf. on Computer, Communication and Control Technol-ogies (CCCT’03), FL, USA, July 2003

6 ‘Xilinx CoreGen and Handel-C,’ Application note AN 58 v1.0, 20017 URL: www.celoxica.com8 Amira, A., Bouridane, A., Milligan, P., and Roula, M.: ‘Novel FPGA

implementations of Walsh Hadamard transforms for signalprocessing’, IEE Proc., Vis. Image Signal Process., 2001, 148, (6),pp. 377–383

9 Ohlsson, H., and Wanhammer, L.: ‘Maximally fast numericallyequivalent state-space recursive digital filters using distributedarithmetic’. Proc. IEEE Nordic Signal Processing Symp. (NOR-SIG2000), Kolmarden, Sweden, June 2000, pp. 295–298

10 Gustafsson, O., and Wanhammar, L.: ‘Implementation of a digitalbeamformer in an FPGA using distributed arithmetic’. Proc. IEEENordic Signal Processing Symp. (NORSIG2000), Kolmarden,Sweden, June 2000, pp. 295–298

11 URL: www.xilinx.com12 Styles, H., and Luk, W.: ‘Customising graphics applications:techni-

ques and programming interface’. Proc. IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM), Napa, CA,USA, April 2000, pp. 77–87

13 Singh, S., and Bellec, P.: ‘Virtual hardware for graphics applicationsusing FPGAs’. Proc. IEEE Workshop on FPGAs for CustomComputing Machines, Los Alamitos, CA, USA, April 1994,pp. 49–58

Table 6: Performance comparison with existing CSC cores

Design parameters Slices Throughput(mega-conversion/s)

Proposed architecture 193 234

CAST, Inc [21] 222 112

ALMA, Tech [22] 222 105

Amphion Ltd [20] 204 90

Fig. 13 ‘Baboon’ (512� 512) test imagea Original R0G0B0 imageb Converted image in Y0CrCb format using the proposed systemc Software-based implementation (C++)

IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005 245

Page 11: Accelerating matrix product on reconfigurable hardware for image processing applications

14 Ye, A.G., and Lewis, D.M.: ‘Procedural texture mapping on FPGAs’.Proc. ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays,Monterey, CA, USA, February 1999, pp. 112–120

15 Watt, A.: ‘3-D computer graphics’ (Addison–Wesley, 2000)16 Ferguson, R.S.: ‘Practical algorithms for 3-D computer graphics’

(A K Peters, 2001)17 Eadie, D., Shevlin, F., and Nisbet, A.: ‘Correction of geometric image

distortion using FPGAs’, Proc. SPIE - Int. Soc. Opt. Eng., 2003, 4877,pp. 28–37

18 URL: www.opengl.org19 B. Payette, ‘Color space converter: R0G0B0 to Y0CrCb’. Xilinx

Application Note, XAPP637, V1.0, September 200220 Gonzalez, R.C., andWoods, R.E.: ‘Digital image processing’ (Prentice

Hall Inc, 2002, 2nd edn.)21 ‘‘Color space converters,’’ Datasheet, Amphion Semiconductor Ltd,

(www.amphion.com) DS6400 V1.1, April 2002

22 ‘CSC color space converter’ Application note, CAST Inc, (www.cast-inc.com) April 2002

23 ‘High performance color space converter’. Datasheet, ALMATechnologies, (www.alma-tech.com) May 2002

24 Albiol, A., Torres, L., and Delp, E.J.: ‘An unsupervised color imagesegmentation algorithm for face detection applications’. Proc. Int.Conf. on Image Processing, October 2001, Vol. 2, pp. 681–684

25 Kuchi, P., Gabbur, P., Bhat, P.S., and David, S.: ‘Human facedetection and tracking using skin color modelling and connectedcomponent operators’, IETE J. Res., 2002, 48, pp. 289–293

26 Mitchell, J.L., and Pennebaker, W.B.: ‘MPEG video compressionstandard’ (Chapman & Hall, 1996)

27 Bartkowiak, M.: ‘Optimisations of color transformation for real timevideo decoding’. Presented at Digital Signal Processing forMultimediaCommunications and Services, EURASIP ECMCS 2001, Budapest,Hungary, September 2001

246 IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005