reconfigurable computing for shape-adaptive video processing

8
Reconfigurable computing for shape-adaptive video processing J. Gause, P.Y.K. Cheung and W. Luk Abstract: Various reconfigurable computing strategies are examined regarding their suitability for implementing shape-adaptive video processing algorithms of typical object-oriented multimedia applications. The utilisation of reconfigurability at different levels is investigated and the implications of designing reconfigurable shape-adaptive video processing circuits are addressed. Simple models for representing arbitrarily shaped objects and for mapping them into object-specific hardware designs are developed. Based on these models, several design and reconfiguration strategies, targeting an efficient mapping of shape-adaptive video processing tasks to a given reconfigurable computing architecture, are investigated. A number of real applications are analysed to study the trade-offs between these strategies. These include a shape-adaptive discrete cosine transform characterised by a limited number of different data-dependent computations and a shape- adaptive template matching method consisting of a virtually unlimited number of different computation possibilities. It is argued that shape-adaptive video processing algorithms with a relatively small number of different configuration contexts can often be more efficiently implemented as a static or multiconfiguration design, while a design employing dynamic or partial reconfiguration will be more suitable or even necessary if the number of different computation possibilities is relatively large. 1 Introduction In recent years, reconfigurable computing has been identified as a powerful computing methodology that can combine the advantages of the speed of customised application-specific integrated circuits with the flexibility of software run on general-purpose microprocessors. It is based on reconfigurable logic devices, namely (SRAM based) field programmable gate arrays (FPGAs). These work as coprocessors in assistance to a host processor to hardware-accelerate computing intensive tasks that would otherwise be carried out in software. Reconfigurable computing allows user-level programmability at a low level and facilitates general-purpose computing owing to its reconfigurability. Thus, many applications can use the same hardware [1]. Dynamic or run-time reconfiguration (RTR) is an advanced application area within reconfigurable comput- ing. It is based on the ability of FPGAs to be reconfigured at the run-time of an application. Using RTR, hardware resources can be provided as required, resulting in circuit specialisation and optimisation opportunities that are otherwise not available [2]. However, RTR results in overheads due to additional hardware and reconfiguration time. Consequently it involves a trade-off between the performance and area advantages of optimising a hardware design and the drawbacks of associated reconfiguration costs. Although much research has been carried out within the area of reconfigurable computing, a lot of work still needs to be accomplished to fully understand and evaluate RTR and quantify the trade-offs of run-time reconfigurable devices and systems in a suitable application area [1]. The development of multimedia technology and associa- ted standards like MPEG-4 [3] for coding of audio-visual objects in multimedia applications and MPEG-7 [4] for description and search of audio and visual multimedia content leads to new types of algorithms to process multimedia data and therefore new challenges for their hardware implementation. In addition to very high proces- sing demands, many multimedia processing algorithms are characterised by a decreasing structural regularity and predictability of operations compared to conventional block-based video or audio processing algorithms. Typical examples are algorithms to process arbitrarily shaped multimedia objects: the computations to be performed need to be adapted to the size and the shape of the object. This calls for architectures with increased flexibility and adaptability at run time [5]; a functionality that can be provided through reconfiguration. This paper investigates various reconfigurable computing strategies regarding their suitability for implementing shape-adaptive video processing algorithms of typical object-oriented multimedia applications. Simple models for representing arbitrarily shaped objects and for mapping them into object-specific hardware designs are developed. Furthermore, the utilisation of reconfigurability at different levels is examined and the implications of designing reconfigurable shape-adaptive video processing circuits are addressed. The main focus is given to an efficient mapping of shape-adaptive video processing tasks to a given reconfigurable computing architecture. Different types of q IEE, 2004 IEE Proceedings online no. 20040530 doi: 10.1049/ip-cdt:20040530 J. Gause is with Panasonic System LSI Design Europe (PSDE), West Forest Gate, Wellington Road, Wokingham, Berkshire, RG40 2AQ, U.K. P.Y.K. Cheung is with the Department of Electrical and Electronic Engineering, Imperial College of Science, Technology and Medicine, Exhibition Road, London, SW7 2BT, U.K. W. Luk is with the Department of Computing, Imperial College of Science, Technology and Medicine, 180 Queen’s Gate, London, SW7 2BZ, U.K. Paper first received 20th August 2003 and in revised form 17th February 2004 IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 5, September 2004 313

Upload: w

Post on 19-Sep-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reconfigurable computing for shape-adaptive video processing

Reconfigurable computing for shape-adaptive videoprocessing

J. Gause, P.Y.K. Cheung and W. Luk

Abstract: Various reconfigurable computing strategies are examined regarding their suitability forimplementing shape-adaptive video processing algorithms of typical object-oriented multimediaapplications. The utilisation of reconfigurability at different levels is investigated and theimplications of designing reconfigurable shape-adaptive video processing circuits are addressed.Simple models for representing arbitrarily shaped objects and for mapping them into object-specifichardware designs are developed. Based on these models, several design and reconfigurationstrategies, targeting an efficient mapping of shape-adaptive video processing tasks to a givenreconfigurable computing architecture, are investigated. A number of real applications are analysedto study the trade-offs between these strategies. These include a shape-adaptive discrete cosinetransform characterised by a limited number of different data-dependent computations and a shape-adaptive template matching method consisting of a virtually unlimited number of differentcomputation possibilities. It is argued that shape-adaptive video processing algorithms with arelatively small number of different configuration contexts can often be more efficientlyimplemented as a static or multiconfiguration design, while a design employing dynamic orpartial reconfiguration will be more suitable or even necessary if the number of differentcomputation possibilities is relatively large.

1 Introduction

In recent years, reconfigurable computing has beenidentified as a powerful computing methodology that cancombine the advantages of the speed of customisedapplication-specific integrated circuits with the flexibilityof software run on general-purpose microprocessors. It isbased on reconfigurable logic devices, namely (SRAMbased) field programmable gate arrays (FPGAs). Thesework as coprocessors in assistance to a host processor tohardware-accelerate computing intensive tasks that wouldotherwise be carried out in software. Reconfigurablecomputing allows user-level programmability at a lowlevel and facilitates general-purpose computing owing to itsreconfigurability. Thus, many applications can use the samehardware [1]. Dynamic or run-time reconfiguration (RTR) isan advanced application area within reconfigurable comput-ing. It is based on the ability of FPGAs to be reconfiguredat the run-time of an application. Using RTR, hardwareresources can be provided as required, resulting in circuitspecialisation and optimisation opportunities that areotherwise not available [2]. However, RTR results inoverheads due to additional hardware and reconfiguration

time. Consequently it involves a trade-off between theperformance and area advantages of optimising a hardwaredesign and the drawbacks of associated reconfigurationcosts. Although much research has been carried out withinthe area of reconfigurable computing, a lot of work stillneeds to be accomplished to fully understand and evaluateRTR and quantify the trade-offs of run-time reconfigurabledevices and systems in a suitable application area [1].

The development of multimedia technology and associa-ted standards like MPEG-4 [3] for coding of audio-visualobjects in multimedia applications and MPEG-7 [4] fordescription and search of audio and visual multimediacontent leads to new types of algorithms to processmultimedia data and therefore new challenges for theirhardware implementation. In addition to very high proces-sing demands, many multimedia processing algorithms arecharacterised by a decreasing structural regularity andpredictability of operations compared to conventionalblock-based video or audio processing algorithms. Typicalexamples are algorithms to process arbitrarily shapedmultimedia objects: the computations to be performedneed to be adapted to the size and the shape of the object.This calls for architectures with increased flexibility andadaptability at run time [5]; a functionality that can beprovided through reconfiguration.

This paper investigates various reconfigurable computingstrategies regarding their suitability for implementingshape-adaptive video processing algorithms of typicalobject-oriented multimedia applications. Simple modelsfor representing arbitrarily shaped objects and for mappingthem into object-specific hardware designs are developed.Furthermore, the utilisation of reconfigurability at differentlevels is examined and the implications of designingreconfigurable shape-adaptive video processing circuitsare addressed. The main focus is given to an efficientmapping of shape-adaptive video processing tasks to a givenreconfigurable computing architecture. Different types of

q IEE, 2004

IEE Proceedings online no. 20040530

doi: 10.1049/ip-cdt:20040530

J. Gause is with Panasonic System LSI Design Europe (PSDE), West ForestGate, Wellington Road, Wokingham, Berkshire, RG40 2AQ, U.K.

P.Y.K. Cheung is with the Department of Electrical and ElectronicEngineering, Imperial College of Science, Technology and Medicine,Exhibition Road, London, SW7 2BT, U.K.

W. Luk is with the Department of Computing, Imperial College of Science,Technology and Medicine, 180 Queen’s Gate, London, SW7 2BZ, U.K.

Paper first received 20th August 2003 and in revised form 17th February2004

IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 5, September 2004 313

Page 2: Reconfigurable computing for shape-adaptive video processing

shape-adaptive video processing algorithms are analysedto find reconfigurable hardware architectures that provideefficient implementations. This includes the examination ofa two-dimensional shape-adaptive discrete cosine transform(2D SA-DCT) which is characterised by a limited number ofdifferent data-dependent computations to be performed. Toalso investigate algorithms with a virtually unlimitednumber of different computation possibilities, a shape-adaptive template matching (SA-TM) method to retrievearbitrarily shaped objects within video frames is employed.For this particular application, various reconfigurationstrategies are analysed and experimental results forimplementations of different reconfigurable designspresented.

2 Shape-adaptive video processing

Shape-adaptive video processing can be defined as perform-ing computations on arbitrarily shaped visual multimediaobjects within video data. A typical video frame consists ofa number of semantically defined objects. Consequently,instead of applying the traditional approach of performingoperations on a simple array of pixel values, the objectscan be processed individually, preserving their semantics.To represent an arbitrarily shaped object within an image orvideo frame, colour data is not sufficient. Additional shapeand size information for the object are required. Hence, thefollowing object model is used throughout this paper inaccordance with the MPEG-4 video verification model [6].The size of the object is given by the width and height of itsbounding box, that is the smallest rectangle surroundingthe object. Shape information can be embodied by addinga so-called mask bit for each pixel which is ‘1’ if the pixel ispart of the object being considered, or ‘0’ if it is outside theobject. An example of the representation model for anarbitrarily shaped object within an image or video frame isshown in Fig. 1.

Typical features of many video processing algorithms arelocal data dependencies [5], that is, the value of an outputpixel depends on the value of an input pixel at a certainposition and possibly also on other values of pixels near thatposition. Hence, a circuit for processing an arbitrarilyshaped visual object can be modelled to have the same or asimilar structure as that object, with neighbouring pixelsresulting in adjacent processing elements (PEs) as illus-trated in Fig. 2. Examples include algorithms based onmatching filters or histograms that use visual examples tosearch for a query object. The operation performed within aPE depends on the task to be accomplished. The connectionsbetween adjoining PEs can be spatial or temporal.

To process an arbitrarily shaped visual object thecomputations to be performed need to be adapted to theshape and size of that particular object. While the basic tasksthat are executed may be similar or equal for each pixel, thenumber of operations and the amount of data processed

varies. Both the software tools to describe and to synthesisethe designs, and the hardware architectures to perform thecomputations, should therefore be flexible enough to adaptto processing a particular visual object during applicationrun-time. Using reconfigurable logic, this is possible bymeans of dynamic reconfiguration. We illustrate thisapproach by examples in Sections 5 and 6.2, but first wediscuss some common reconfiguration strategies.

3 Reconfiguration strategies

An application can be split into temporal partitions bymeans of FPGA reconfiguration. There are different waysthe reconfiguration facilities of FPGAs can be used forshape-adaptive video processing. The two temporal possi-bilities are static or compile-time reconfiguration anddynamic or run-time reconfiguration (RTR). Static recon-figuration takes place before the start of an application,while dynamic reconfiguration can occur at any instant intime during the run-time of an application. Furthermore,reconfiguration can be applied to the FPGA either globallyor partially. Whereas global reconfiguration means down-loading the configuration bit stream each time for the entireFPGA, independent of the size of the hardware design,partial reconfiguration allows changing of parts of thedevice configuration, while the rest remains unaltered.

3.1 Static reconfiguration

A statically reconfigurable design (static design) doesnot change at all during the run-time of an application.The design is compiled once and the resulting configurationbit stream is downloaded to the FPGA before or at the startof the application run-time. The associated design musttherefore contain all possible computations that may occur.Regarding shape-adaptive video processing, a static designmust be able to perform the task for all possible arbitrarilyshaped objects. However, as the number of object shapesand sizes may be unlimited, only a subset of all options canbe implemented. A solution is viable if the size of the videoframe where the objects are processed is fixed, as the objectcannot be larger than the frame size.

Advantages of a static design are that neither recompila-tion of the design description nor reconfiguration of theFPGA is necessary. However, the large, external memoryto store all possible object pixels and mask bits make thedesign slower and more complex than an optimal design.

Fig. 1 Representation of arbitrarily shaped visual object withing image or video frame

Fig. 2 Circuit model for shape-adaptive video processing

IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 5, September 2004314

Page 3: Reconfigurable computing for shape-adaptive video processing

In addition, a large and therefore generally more expensivedevice is necessary to cover for all possible object sizes.Examples of a statically reconfigurable design can be foundin Sections 5 and 6.2.

3.2 Dynamic reconfiguration

In this approach the device is reconfigured for everypossible object size and shape, and for every possible videoframe size. This generally includes the recompilation of thedesign code due to the infinite number of solutions. Sincethe arbitrarily shaped object can be part of the configurationdata and circuits can be optimised, an efficient solution canbe achieved and fewer resources than within a static designare required. In contrast to a static design, the dynamicallyreconfigurable design (dynamic design) allows constants,such as pixel values for a particular visual object, to bepropagated directly into the processing elements [7]. Thiscapability to modify circuits by means of reconfiguration atrun-time overcomes the inability of static designs to exploitcircuit specialisation. Although dynamic reconfigurationprovides opportunities for circuit specialisation and optim-isation, additional time is required to synthesise the newdesign and to reconfigure the FPGA. The compilation timeand reconfiguration time depend on the size of the circuit tobe implemented. The reconfiguration time is generallyproportional to the number of resources to be reconfigured,but also depends on the device used and its reconfigurationfacilities. Compilation time is generally hard to estimate asit depends on the device and the tools used to translate andmap the design description.

A dynamic design involves a run-time trade-off betweenimproving the efficiency through circuit specialisation andthe additional overhead of compilation and reconfiguration.Hence, it is only suitable for shape-adaptive videoprocessing in real time, if the improvements in adaptingthe circuit to processing a particular object outweigh thedisadvantages of the extra time needed to change the design.This can be achieved only by efficient methods tosignificantly decrease mapping and reconfiguration timesor by changing designs economically. Examples ofthis reconfiguration approach are presented in Sections 5and 6.2.

3.3 Partial reconfiguration

To combine the advantages and to dispose of some of thedisadvantages of the static design and the dynamic design,a third design is presented. It is called partially reconfigur-able design and is based on the assumption that in a shape-adaptive video processing circuit which performs the sametask on a large number of different objects, there will bemany similarities between different configurations. Forexample, the processing elements to execute operations onpixel values may have the same function, but they usedifferent internal constants or their signals have a differentword length. Hence, reconfiguration time can be saved byonly transmitting the differences between configuration bitstreams of two designs instead of changing the configurationfor the entire circuit or device. A partially reconfigurabledesign can be precompiled since the functionality of thecircuit does not change when a different arbitrarily shapedobject is to be processed. The difference to the static designis that a partially reconfigurable design can make use ofcircuit specialisation methods such as constant folding bystoring the object pixel values and mask bits in on-chipmemory available on most FPGAs. To change the object,only a reconfiguration of the memory parts is necessary;the structure of the circuit remains the same. However, the

circuit must still be able to deal with a possibly unlimitednumber of object shapes and sizes and cover all compu-tations that may occur. Hence, the size of the circuit, as in astatic design, depends on the largest possible object to beprocessed. In addition, processing elements must workcorrectly for both pixels that belong to an object and thosethat do not.

The major advantage compared to a dynamic design is thereduction in reconfiguration time. In addition, no recom-pilation of the design may be necessary if the configurationbits to be changed can be easily identified. Compared with astatic design, the main benefit is the localisation of the dataflow and the reduction of I=O since in a partiallyreconfigurable design the object data can be stored on-chip. To make these designs viable, however, there must befacilities within the reconfigurable logic that allow partialreconfiguration and design tools that can efficiently usethem. An example of this approach is shown in Section 6.2.

3.4 Multiconfiguration

Another approach to reduce the reconfiguration overhead isto store more than one design configuration on the FPGA atthe same time, as proposed in [8]. To switch betweendifferent designs, one only needs to select anotherconfiguration bit-stream which is already on the chip bymeans of multiplexing. This avoids having to read theconfiguration data from off-chip memory. For shape-adaptive video processing, however, the number of differentconfigurations may be unlimited or very large. Hence, onlya subset of possible configurations can be stored on-chip.If a required configuration is not available, it has to beloaded from outside the FPGA chip, resulting in longerreconfiguration times. Although multiconfiguration deviceshave been considered by major FPGA vendors [8], theyhave so far not been made commercially available.

4 Design strategies

Before a shape-adaptive video processing algorithm can beexecuted it needs to be implemented by the hardwaredesigner and efficiently mapped to the computing platformon which it will later be used, possibly following one of thereconfiguration strategies described in the previous Section.To describe and map reconfigurable hardware designs forprocessing arbitrarily shaped objects, flexible methods fordesign entry are required. As the object to be processed isnot known at compile-time, it needs to be possible to mapa parameterised description of the circuit associated to thatobject to the reconfigurable architecture at run-time. Thisnormally involves changing and recompiling the designdescription due to the possibly unlimited amount of differentobjects, and downloading the altered configuration stream tothe FPGA as illustrated in Fig. 3. Consequently, an efficientdesign description method is required which

. is able to exploit the inherent parallelism within videoprocessing algorithms. can easily adapt the design to varying arbitrarily shapedobjects (parameterisation). makes effective use of the reconfigurable logic resourcesof the FPGA. takes advantage of the reconfiguration capabilities of thedevice. can be synthesised in a reasonable amount of time. is easy to learn and use

Hardware description languages (HDLs) are typically usedto describe static circuits. Regarding shape-adaptive video

IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 5, September 2004 315

Page 4: Reconfigurable computing for shape-adaptive video processing

processing, however, there are important differencesbetween these HDLs in their capabilities to efficientlyadapt a design description to varying visual objects,including run-time reconfiguration support, parameterisa-tion and conditional compilation facilities. While someapproaches have been made to utilise reconfigurable logicmore efficiently [9–11], none of the currently availableHDLs and tools provide sufficient support for the descrip-tion of flexible shape-adaptive video processing circuits.

Shape-adaptive video processing involves performing anidentical task on a great variety of objects with differentshapes and sizes. Although the same operation may beperformed for each pixel, the structure of the entire circuitperforming the computations needs to be adapted to thefeatures of a particular object. Consequently a varyingnumber of modules that carry out a particular operation anda changing interconnection structure to link these moduleswill be necessary [12]. In addition, requirements regardingperformance and utilisation of logic resources may vary atdifferent points in time. To quickly and efficientlyimplement algorithms that can deal with a large numberof arbitrarily shaped and sized objects, it is advantageous toparameterise these processing modules and store them inhardware libraries [13]. These modules can then be reusedin a number of applications, reducing the overall designeffort. Parameterisation of predefined processing elementsor circuit blocks could also be performed at run-time of anapplication, rather than at design time. This would enableapplications to adapt their functionality to changingconditions during execution. However, this method impliesthat the application can manipulate the underlying structureof the target hardware automatically [12]. Run-timeparameterisation of hardware cores has been investigatedin a number of projects such as [10, 14, 15], but is currentlylimited to specific devices and tools.

Dynamic FPGA reconfiguration is necessary to adapt areconfigurable computing architecture to the processing of aparticular visual object during the run-time of an appli-cation. This generally involves synthesising a new designdescription and downloading an altered configurationstream to the FPGA. However, traditional hardwaresynthesis tools are targeted to static designs which do notchange. For reconfigurable logic, synthesis involves parsingthe design description file, mapping logic into LUTs andplacing these, together with registers, into logic cells (LCs)on the device. The LCs then have to be interconnected witheach other and with I=O blocks by configuring the switchesof the routing resources of the FPGA, before the resultingconfiguration bit-stream can be downloaded to the device.This implies that the HDL description, which itself remains

unchanged, must be able to read the altered object pixelvalues and mask bits from a file and synthesise a new circuitbased on those new parameters. Thus, conditional compi-lation of a parameterised design description is necessary.All these steps can be very time-consuming, especially forlarge designs, making this design approach unsuitable iffrequent reconfigurations are required.

To make shape-adaptive video processing using reconfi-gurable computing feasible at run-time, the main objectivemust be to find efficient methods to keep the design mapping(compilation) time and the device reconfiguration time aslow as possible. A solution for shape-adaptive videoprocessing would be to directly map a new design to theconfiguration bit stream of the device, bypassing the othersteps in the synthesis flow as much as possible. Changingthe arbitrarily shaped object to be processed may not requirethe whole circuit to be changed as operations to beperformed within particular PEs often remain constant;they simply operate on different data. Hence some parts ofthe configuration bit-stream which contain the mapping androuting information may not change when the circuit isupdated to process a new object. However, configurationdata are device-specific and generally difficult or impossibleto read or write manually [15]. Due to the unavailability ofsuitable design tools supporting dynamic reconfigurabilityas described, the examples presented in the followingSections were designed and reconfigured ‘by hand’.

5 Reconfigurable shape-adaptive DCT designs

The shape-adaptive discrete cosine transform (SA-DCT) isan example of a shape-adaptive video processing algorithmwhere the number of different possibilities of computationsthat may occur is limited. The SA-DCT has been included inthe MPEG-4 standard [6] for coding pixels in arbitrarilyshaped object segments. A hardware implementation of atwo-dimensional SA-DCT (2D SA-DCT) is not as straight-forward as the implementation of the standard 8 � 8 DCT[16] where the transform is always performed on eightpixels per row and eight pixels per column. On the contrary,the SA-DCT consists of a number of DCT-N computationswhere N is variable and depends on the number of pixels in arow and column of the object segment. The actual DCT-Ncalculation carried out depends on the number ofobject pixels within an 8 � 8 block and is not known atcompile-time.

A static implementation of the 2D SA-DCT as presentedin Section 3.1 needs to be able to calculate the right resultfor every possible shape of the object within the 8 � 8 imageblock. The configuration data of the FPGA must therefore

Fig. 3 Design flow for reconfigurable shape-adaptive video processing

IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 5, September 2004316

Page 5: Reconfigurable computing for shape-adaptive video processing

contain all possible DCT-Ns for 1 � N � 8: Hardwareresources can be shared amongst PEs that produce the sameoutput signal but for different values of N. The total numberof PEs required to implement a static 1D SA-DCT design is64. This is equal to the number of PEs necessary for aDCT-8 implementation, but 68:6% less compared with thenumber of PEs in a straightforward implementation withouthardware resource sharing between DCT-N entities [17]. Bysharing PEs, however, the PE itself becomes more complexas not all resources can be shared within a PE which is towork in different contexts.

In a dynamic implementation of the 2D SA-DCTaccording to Section 3.2, the configuration of the FPGAdepends on the shape of the object to be transformed. Sincehardware resources are shared temporally, rather thanspatially, amongst different DCT-N entities by means ofreconfiguration, a particular DCT-N configuration can beoptimised through circuit specialisation. While the actualcomputation time of a dynamic SA-DCT implementation isshorter than that of a static design due to a shorter criticalpath through a PE and a smaller number of PEs for N < 8;additional reconfiguration time is required to switchbetween DCT-N entities, significantly influencing theoverall performance of the design [17].

Due to the fact that an SA-DCT design has a relativelysmall complexity since it is limited to an 8 � 8 image block,a static design would be preferable [17]. As all eightinstances that can occur are known at compile-time,resources can be shared very efficiently on-chip. In addition,there are no reconfiguration overheads and the computationtime is relatively short, in contrast to other designapproaches which require a large number of reconfigura-tions and a comparatively long time to undertake them.In the following Section, a shape-adaptive video processingalgorithm is considered, where the number of possibledesign entities is virtually unlimited and where the exactcircuit requirements are not known at compile-time.

6 Reconfigurable shape-adaptive templatematching designs

The aim of the shape-adaptive template matching (SA-TM)method is to find a template object of arbitrary shape andsize within a search image or video frame of any size usinga reconfigurable computing architecture [18]. The SA-TMalgorithm is defined as follows. The search frame consists ofW � H pixels, while the template object consists of p opaquepixels and can have any shape. It is given by its boundingbox of size w � h; that is the smallest rectangle surroundingthe object. Within this bounding box, each pixel containsone mask bit which is ‘1’, if the pixel belongs to the object,or ‘0’ otherwise.

The template is shifted through every possible location ofthe image that can contain the entire template and comparedto the respective image or video subframe of the same size.For all ðW � w þ 1Þ � ðH � h þ 1Þ possible positions ( y, x)of the template within the image, calculate

SADðy; xÞ ¼Xh�1

i¼0

Xw�1

j¼0

ðjIði þ y; j þ xÞ � Tði; jÞj � Mði; jÞÞ

ð1Þ

where I is the pixel value of the search image or frame,and T and M are the pixel value and the mask bit ofthe template object, respectively. The difference fromtraditional template matching algorithms is that here only

pixels of the subframe that correspond to pixels belonging tothe template object are taken into account.

6.1 Computation flow analysis of SA-TMalgorithm

The computation flow graph (CFG) for a generic SA-TMarchitecture, which is defined by a set of processingelements (PEs) connected by directed edges showing thedirection of the computation flow, is shown in Fig. 4a.It comprises an array of ðW � HÞ � ððW � w þ 1Þ � ðH �h þ 1ÞÞ PEs labelled by the parameters (a, b, c, d) thecomputation in a certain PE depends on. A PE at position((a, b),(c, d)) represents the computations performed usingthe input signal I(a, b) and resulting in a contribution to theoutput signal SAD(c, d). The CFG provides a presentation ofthe computation flow that is valid for all possible SA-TMinstances. This helps to examine how hardware resourcescan be shared between SA-TM calculations for differenttemplate objects and search frame sizes.

The internal structure of a PE for the SA-TM computationis visualised in Fig. 4b. Each PE computes the absolutedifference (AD) between a constant CTða; b; c; dÞ and thevalue of the signal coming in from the left and adds (ADD)the result to the value of the signal coming in from the top ofthe PE. Either the result of that operation or the unalteredvalue of the signal coming in from the top of the PE isselected by a multiplexer and carried through to the outputat the bottom of the PE. The multiplexer is controlled bya one bit wide constant CMða; b; c; dÞ:

To match a w � h template with p pixels belonging to theobject ðp � w � hÞ within a W � H search frame (W > wand H> h) at position (c, d), only p pixel values of thesearch frame are taken into account to compute a particularoutput value SAD(c, d). In the CFG in Fig. 4, the PEs thatcorrespond to pixels of the sub-frame where the template ismatched and hence calculate results that may contribute toa particular output signal are shaded, while the unshadedPEs are not required. Out of a possible W � H input signals,a particular output signal requires a maximum of w � hinput signals. This fact is taken into account throughthe definition of the PE constants CTða; b; c; dÞ andCMða; b; c; dÞ: The constant coefficient CTða; b; c; dÞ forthe absolute difference computation represents the pixelvalues of the template object where required and depends onthe four CFG parameters a, b, c and d

CTða;b; c;dÞ ¼Tða � c; b � dÞ if ð0 � ða � cÞ< hÞ

and ð0 � ðb � dÞ< wÞdon’t care else

8<:

ð2Þ

The bit controlling the multiplexer is defined as

CMða;b; c;dÞ ¼Mða � c; b � dÞ if ð0 � ða � cÞ< hÞ

and ð0 � ðb � dÞ< wÞ0 else

8<:

ð3Þ

Note that PEs with the same values of ða � cÞ and ðb � dÞ;depicted in the same shade of grey in the CFG, correspondto the same pixel values of the template. Hence, these PEsare identical and can be shared amongst the calculations ofdifferent output values if not required at the same time. Theway that PEs can be shared amongst different SA-TMentities depends on the reconfiguration strategy applied.

IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 5, September 2004 317

Page 6: Reconfigurable computing for shape-adaptive video processing

6.2 Reconfigurable computing strategies forSA-TM

The design approaches presented here are based on ageneric semisystolic SA-TM array adapted to the shape ofthe template object as proposed in [18]. The PEs arearranged in a two-dimensional manner by mapping thetemplate object into the top-left corner of the search frameand connecting the PEs accordingly. The pixel values of thesearch frame are broadcast to all PEs sequentially and ina line-by-line raster-scan fashion. For simplicity, onlyluminance values are used. In addition to the PEs, registersare required at all places between PEs where signals have tobe delayed to realise a correct computation flow.

In a dynamic design approach as presented in Section 3.2,the device is reconfigured for every possible template sizeand shape, and for every possible search frame size. Thisgenerally includes the recompilation of the design code, asthe number of possible solutions is infinite. Since thetemplate can be part of the configuration data and wordlengths can be optimised, an efficient semisystolic array

solution can be achieved [18]. As an alternative, a staticdesign as described in Section 3.1, where the configurationof the FPGA is not changed when a new template ismatched, can be used. No recompilation of the design codeor reconfigurations of the device are necessary for a constantsearch frame size as the template is stored off-chip.However, the large, external RAM to store all possibleWH template pixels and mask bits makes the design slowerand more complex than an optimal design [18]. To combinethe advantages of the dynamic and the static design,a partially reconfigurable design according to Section 3.3is also possible. The difference from the static design is thatthis design stores the template pixels and mask bits inon-chip memory available on most FPGAs. To change thetemplate, only a reconfiguration of the memory parts isnecessary, the rest of the circuit remains the same [18].

6.3 FPGA implementation and results

Processing elements (PEs) for the three reconfigurabledesigns presented were implemented targeting Xilinx VirtexXCV1000E devices and investigated regarding area

Fig. 4 Computation flow graph for generic SA-TM architecture

a Computation flow graph for SA-TMb Structure of processing element

IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 5, September 2004318

Page 7: Reconfigurable computing for shape-adaptive video processing

and execution time requirements. While the number of logiccells to implement the static design and the partiallyreconfigurable design is constant for a certain frame size, thedynamic design is adapted to the template object leading tosignificant savings in area, especially if relatively smalltemplates are used [18].

The total execution time for a reconfigurable design tofind a template object in N consecutive video frames consistsof the computation time to calculate the SA-TM operations,the reconfiguration time to update the FPGA for a newtemplate, and the compilation time which is required sincethe number of template objects and therefore the number ofdifferent device configurations is unlimited. The compu-tation time is determined by the maximum frequency a PEcan run at and the number of computations to be performed.The reconfiguration time is generally proportional to thenumber of resources to be reconfigured, but also depends onthe device used and the reconfiguration strategy. Compi-lation time depends on the hardware and software used totranslate and map the code and is hard to estimate.

The maximum frequency of PEs with different internalword lengths for all three design implementations isillustrated graphically in Fig. 5. As expected, the maximumclock frequency is highest for the dynamic design andlowest for the static design. Area requirements increaselinearly with the word length. For a particular word length,the logic resource requirements are smallest for a dynamicdesign and largest for a static design.

Based on the results of the implementation of a single PE,complete semisystolic SA-TM arrays can be considered.However, due to limitations in current HDLs, it is difficult toadapt a design description to a particular object. Further-more, complex designs take too long to be synthesised or donot fit into currently available FPGAs.

Using the maximum clock frequency of a PE (not takinginto account FPGA routing delays) and the reconfigurationtimes (if applicable) for a Virtex XCV1000E device, totalexecution times of complex SA-TM designs can still beinvestigated using suitable estimation models [18]. Theresults show that the suitability of a particular designregarding total execution time of the algorithm stronglydepends on the number of consecutive frames the operationsare carried out on with the same template. Since the staticdesign does not suffer from reconfiguration overheads, it ismost suitable for an operation on one or only a few frames.However, as the partially reconfigurable design andespecially the dynamic design can operate at higher clockfrequencies, they perform better if the matching algorithm isexecuted on a large number of frames.

7 Conclusions

This paper has investigated reconfigurable computingstrategies regarding their suitability for implementing

shape-adaptive video processing algorithms used inobject-oriented multimedia applications. To evaluate thesuitability of reconfigurable computing for shape-adaptivevideo processing, simple models for representing arbitrarilyshaped objects and for mapping them into object-specifichardware designs have been presented. Based on thesemodels various design and reconfiguration strategiestargeting an efficient way of mapping shape-adaptivevideo processing tasks to a given reconfigurable computingarchitecture have been investigated. Regarding design tools,comprehensive parameterisation facilities are necessary todescribe generic circuits used to process a large number ofvarying objects and to allow design reuse. Althoughparameterisation of designs is often possible, most HDLsand tools have only limited facilities to exploit run-timereconfiguration efficiently.

Four different reconfiguration strategies, namely static,dynamic, partial, and multiconfiguration have been pre-sented. Whereas the latter three strategies have the potentialto provide great benefits in performance and resource usagethrough circuit specialisation by means of run-timereconfiguration, they also require the devices used tosupport these strategies and suffer from reconfigurationoverheads. In studying the trade-offs between thesestrategies regarding their suitability for implementingshape-adapted video processing algorithms, a number ofreal applications have been investigated.

As an example of a shape-adaptive video processingalgorithm, a shape-adaptive video processing algorithmcharacterised by a virtually unlimited number of differentcomputation possibilities has been considered. As anexample of such an algorithm, a shape-adaptive templatematching (SA-TM) method to retrieve arbitrarily shapedobjects within video frames has been used to examinevarious reconfigurable computing strategies. The suitabilityof a particular design regarding total execution time of thealgorithm strongly depends on the number of consecutiveframes the operations are carried out on with the sametemplate.

This paper has demonstrated that shape-adaptive videoprocessing algorithms with a relatively small number ofdifferent configuration contexts can often be more effi-ciently implemented as a static or multiconfigurationdesign, mainly due to the large reconfiguration overheadsassociated with dynamic designs when currently availableFPGA devices are used and frequent reconfiguration isrequired. While a static design, as also used for any non-adaptive video processing algorithm, can select betweendifferent context-adapted circuits by means of multiplexersand run-time parameters that are folded into the archi-tecture, multiconfiguration designs can switch betweena limited number of device configurations associated tocertain context-adapted circuits. If the number of differentcomputation possibilities is relatively large or unlimited,however, a static or multiconfiguration design is nolonger viable due to the large amount of hardwareresources involved. In these cases, a dynamic or partially-reconfigurable design will be more suitable or evennecessary, and the main design objective would be tokeep the reconfiguration overhead as low as possible.

In addition to the reconfiguration strategies examined,other approaches may also be suitable for shape-adaptivevideo processing and could be evaluated. This could includecombinations of the methods presented here or noveltechniques that have not yet been considered. Dynamicreconfiguration will become a more important designmethodology for reconfigurable computing if the reconfi-guration overhead can be reduced drastically. This requires

Fig. 5 Maximum clock frequencies of different sized PEs fordynamic (D), static (S) and partially reconfigurable (PR) designs

IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 5, September 2004 319

Page 8: Reconfigurable computing for shape-adaptive video processing

both architectural modifications to improve reconfigurationtime and partial reconfiguration and enhanced design toolsto simplify the development of dynamically reconfigurabledesigns.

8 References

1 Villasenor, J., and Hutchings, B.: ‘The flexibility of configurablecomputing’, IEEE Signal Process. Mag., 1998, pp. 67–84

2 Wirthlin, M., and Hutchings, B.: ‘Improving functional density throughrun-time reconfiguration’, IEEE Trans. VLSI Syst., 1998, 6, (2),pp. 247–256

3 ‘Overview of the MPEG-4 standard’, MPEG Group: ISO/IEC JTC1/SC29/WG11 N4030, 2001

4 ‘Overview of the MPEG-7 standard’, MPEG Group: ISO/IEC JTC1/SC29/WG11 N4031, 2001

5 Pirsch, P., Reuter, C., Wittenburg, J.P., Kulaczewski, M.B.,and Stolberg, H.-J.: ‘Architecture concepts for multimedia signalprocessing’, J. VLSI Signal Process., 2001, 29, (3), pp. 157–165

6 ‘MPEG-4 video verification model version 18.0’, MPEG Group:ISO/IEC JTC1/SC29/WG11 N3908, 2001

7 Wirthlin, M., and Hutchings, B.: ‘Improving functional densitythrough run-time constant propagation’. Proc. ACM 5th Int. Symp.on FPGA, 1997, pp. 86–92

8 Trimberger, S., Carberry, D., Johnson, A., and Wong J.: ‘A time-multiplexed FPGA’. Proc. IEEE Workshop on FPGAs for CustomComputing Machines, 1997, pp. 22–28

9 Bellows, P., and Hutchings, B.: ‘JHDL-an HDL for reconfigurablesystems’. Proc. IEEE Symp. on FPGAs for Custom ComputingMachines, 1998, pp. 175–184

10 Guccione, S., Levi, D., and Pundararajan, P.: ‘JBits: Java-basedinterface for reconfigurable computing’. Proc. Conf. on Military andAerospace Application of Programmable Devices and Technology,1999

11 Luk, W., and McKeever, S.W.: ‘Pebble: a language for parameterisedand reconfigurable hardware design’, Lect. Notes Compt. Sci., 1998,1482, pp. 9–18

12 Derbyshire, A., Gause, J., and Luk, W.: ‘Incremental routing for run-time parametrisable designs’. Proc. 3rd Workshop on ReconfigurableComputing and Applications, 2003

13 Luk, W., Kean, T., Derbyshire, A., Gause, J., McKeever, S.W., andYeow, A.: ‘Parameterised hardware libraries for configurable system-on-chip technology’. Proc. of World Multiconference on Systemics,Cybernetics and Informatics, 2001, vol. XV, pp. 223–228

14 James-Roxby, P., Cerro-Prada, E., and Charlwood, S.: ‘Core-baseddesign methodology for reconfigurable computing applications’,IEE Proc., Comput. Digit. Tech., 2000, 147, (3), pp. 142–146

15 Derbyshire, A., and Luk, W.: ‘Compiling run-time parametrisabledesigns’. Proc. IEEE Int. Conf. on Field-programmable Technology,2002, pp. 44–51

16 Sikora, T., and Makai, B.: ‘Shape-adaptive DCT for generic coding ofvideo’, IEEE Trans. Circuits Syst. Video Technol., 1995, 5, pp. 59–62

17 Gause, J., Cheung, P.Y.K., and Luk, W.: ‘Static and dynamicreconfigurable designs of a 2D shape-adaptive DCT’, Lect. NotesComput. Sci., 2000, 1896, pp. 96–105

18 Gause, J., Cheung, P.Y.K., and Luk, W.: ‘Reconfigurable shape-adaptive template matching architectures’. Proc. IEEE Symp. on Field-Programmable Custom Computing Machines, 2002, pp. 98–107

IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 5, September 2004320