ieee paper

1A Fast and Area Efficient 2-D Convolver forReal Time Image Processing

Dharmendra Kumar Yadav, Ajay Kumar Gupta and Amit Kumar MishraElectronics and Communication Department, Indian Institute of Technology Guwahati

AbstractTwo dimentional (2-D) convolver is a basicprocessing unit used in real time video and image pro-cessing algorithms. VLSI chip area and external memorybus bandwidth are two major concerns of efficient andfast 2-D convolver. Till now people have considered bothof these parameters seperately. Achieving low complexityand high packing density at the same time is a difficulttask. This paper proposes a new FPGA oriented architec-ture which uses a multiplierless 2-D convolver with areaefficient buffering scheme which achieves high speed andsaves upto 80% of chip area at the same time.

I. INTRODUCTIONImage and video processing algorithms need to handle

huge amount of data [1]. Realtime applications need fastprocessing, which can be achieved by good bufferingschemes. A 2-D convolver for this type of applicationsis both computationally expensive and memory-intensive[2]. For a convolution mask of size R S, R Smultipications, R (S 1) additions alongwith R Saccesses to the input data is required for a throughputrate of 1 clock/pixel.2-D convolver using parallel architecture, which exploitethe inherent data instruction level parallelism of 2-Dconvolution, were proposed to speed-up the calculation[3]-[4]. Initially SRAM and DRAM were used in orderto hold input data, but its memory bandwidth can notdirectly satisfy the requirements of most of the real timeapplications. In order to avoid direct access of data toexternal memories, on chip internal data buffers canbe designed to obtain sufficiently large bandwidth byattaching multiple data ports to the internal data buffers[5]. Full buffering and single windowed partial bufferigschemes have been proposed in the R S convolverarchitecture. In full buffering, a large portion of the databuffers are utilized in delay lines to temporarily hold thedata. This method is not economical when convolutionmask size is large because in case of large convolu-tion mask, large amount of FPGA resource is needed.Another buffering scheme known as single windowedpartial buffering scheme was proposed in [3], whichsufficiently reduces the onchip FPGA resource becausedelay lines are elimineted. However it causes increase inexternal memory bandwidth by R times for the samethroughput rate. In order to achieve high performanceas well as less chip area we are proposing a schemeknown as multiwindowed partial buffering (MWPB),which provides a good balance between onchip resourceutilization and external memory bandwidth.Reduction in chip area can be achieved by reducingthe chip area of the 2-D convolver block (figure 1). 2-D convolver for real time image processing needs large

number of multipliers, which requires large amount ofchip area. Several architecture have been introducedto reduce the chip area using a number of shift-and-accumulations (SAs) and adders [6],[7]. The proposedarchitecture in this paper uses 2-D convolver block,which is using only single SA without degrading theperformance. The combination of this 2-D convolverblock and MWPB scheme is proposed in this paper whichimproves the performance, both in terms of throughputand chip area.Rest of the paper is organized as folows. The next sectiondiscusses two major architectures from the literature andintroduces the proposed architecture. The third sectionanalyzes the performance of the proposed architectureand compares it with those from the literature. The lastsection ends the paper with some conslusive remarks.

II. PROPOSED ARCHITECTUREIn this section we will first introduce the components

of the proposed architecture and its comparision tovarious existing architectures. Finally we will present thearchitectural details of the proposed buffering schemeand an area efficient 2-D convolver. We have considerdthe input image size of M N and convolution masksize of R S throghout this paper.

A. MWPB schemeIn general the on chip data buffer is used to avoid

direct access to external memories. In some papers fullbuffering scheme was adapted in order to povide dataat fast rate for the 2-D convolver[3]. In this scheme theexterral memory pixels are shifted into buffers line byline until raster lines and the first pixels in the next lineare loaded (see Fig.2). For achieving throughput rate of1 clock/pixel, each new pixel shifted will effectively movethe convolution window to a next position.In the FB scheme R1 delay-lines, each of length NSand R sets of register arrays and each consisting of Sshift registers are employed to provide data used by the2-D convolver. The advantage of a FB scheme is that onlya single datapath is needed to feed data to the internalbuffers. The disadvantage is that, because of the delay-lines a large number of shift registers and hence largeFPGA resourses are needed [5], [8]. In case the externalmemory bus word length is larger than the pixel datalength, an input first-in first-out (FIFO) stack is alsoincluded in in this architecture.An alternative to FB, single widowed partial buff-ing(SWPB) scheme was proposed in which a smallnumber of image pixels are stored in the on-chip buffers[3],[5],[8]. In SWPB scheme each set of shift register

2Fig. 1. Generic block diagram of proposed architecture

array in the convolution window receives the pixelsbelonging to consecutive rows of the input image througha FIFO. With pixels shifted from FIFOs into shift registerarrays, a column of data is read into the convolutionwindow, and consequently the window moves to a nextposition (see Fig.3).Compared with FB in SWBP, delay-lines are completlyeliminated (figure 3). Therefore, a large reduction interms of shift registers is achieved at the expense ofa small increase in FIFO. The large external memorybus bandwidth requirement is the disadvantage of thismethod. Large external memory bus bandwidth require-ment limits the speed of the process. In order to achieve1 clock/pixel throughput rate, the external memory busbandwidth requirement is R pixels/clock.

Figure 4 illustrates the proposed multiwindow prtialbuffering (MWPB) scheme. The basic idea of the MWPBscheme is to reuse data that are already stored in theinternal buffers [9]. It requires R + S 1 shift registerarrays to hold all the pixels in the (R + S 1) Sarea (figure 4). Unlike the full-buffering scheme and theSWPB scheme, the S pixel data in each set of shiftregister array are not simultaneously fed to the 2-Dconvolver, rather in a serial manner. Only one registerin the shift register array is accessible in each cycle, anda rotationally incremented pointer is used to address theoutput register. Therefore, a total of R + S 1 pixelsof the same column in the input image, belonging toS neighboring windows in the column-wise direction,are provided to the 2-D convolver in each cycle. AfterS cycles, all the data in the current (R S 1) areahave been fed to the 2-D convolver. The shift registerarrays are then be updated. A new column of data willbe shifted from the FIFOs and this effectively moves the(R S 1) S area to a next position.

For the MWPB scheme, multiple dataflow must alsobe provided to update the convolution window. Butunlike the SWPB scheme, the convolution window inthe MWPB scheme is updated every S cycles, whichmeans that the shift register arrays work at a muchlower frequency (shift every S cycles). Therefore, a totalof R + S 1 pixels will be fetched from the externalmemory every S cycles, and the resulting memory busbandwidth is (R + S 1)/S pixels/clock. For most 2-Dconvolution masks, this means only an approximate 2times increase in the external memory bus bandwidthcompared with the FB schemes. One disadvantage forthe MWPB scheme is that the output pixels are nolonger in the raster scan format. Instead, a column-major zigzag scan format will be generated. Row-majorzigzag scan path may also be generated by making somemodifications to the buffering scheme as shown in figure4, so that each FIFO will contain column-wise data. The

Fig. 2. Full buffering Scheme

Fig. 3. Single Windowed Partial Buffering Scheme

advantages of this scheme is small external memory busbandwidth requirement with small ammount of on-chipFPGA resources for buffers.

B. Multiplierless 2-D convolverThis section describes the modified multiplierless 2-

D convolver algorithm [6]. We are considering a pixelposition in the image data at (x,y) and the mask size ofa convolution filter is 33, mask coefficients are H(x,y),the input sequence is F(x,y) and the output sequence isG(x,y). Then using 2-D convolution method the outputsequences is given by the equation (1)

G(x, y) =

2i=0

2j=0

H(i, j)F (x i, y j) (1)

Implimenting the above equation will require nine 8-bit8-bit multipliers and a 16-bit tree adder (eight 16-bitadders) as shown in Fig.5. Since multipliers require largearea on VLSI chip, the above equation can be modifiedso as not to use the multipliers and hence to save area.

Considering 8-bit coefficients, the coefficients can bewritten as follows

H(i, j) =

7k=0

hk(i, j)2k (2)

3Fig. 4. Multi Windowed Partial Buffering Scheme

Fig. 5. The filter architecture using multipliers based on (1).

Where hk(i, j) is the is bit-formated data, i.e., either 0or 1 and k is the weight of each partial product. i.e.equation (1) can be written as

G(x, y) =

2i=0

2j=0

[7

k=0

F (x i, y j)hk(i, j)2k

](3)

Hence, the summation considering the weight k can becomputed by shift-and-accumulation operations insteadof multiplications, each multiplier being replaced by aSA. Several architectures based on the equation (3) havebeen proposed [10],[11]. Implimentig equation (3) di-rectly, it will require nine 16-bit shift-and-accumulations(SAs) and the 16-bit tree adder (eight 16-bit adders),

Fig. 6. The filter architecture using multipliers based on (2).

as shown in Fig.6. Hence, the architectures based onthe equation (3) needs smaller VLSI area than thearchitectures based on the equation (1) [12].

Further modification can be done in equation (3) toreduce the number of SA. By exchanging the summationsequence of equation (3) we will get modified equation(3) as follows:

G(x, y) =

7k=0

[2

i=0

2j=0

F (x i, y j)hk(i, j)2k

](4)

Note that the result inside the brace in (3) is 16-bit, butthe result inside the brace in (4) is 8-bit. Therefore, (4)requires eight 8-bit adders instead of eight 16-b addersin case of (3). Fig.7 shows the proposed filter architecturebased on (4), which consists of 72 two input AND gates,an 8-bit adder tree (eight 8-bit adders) and only one SA.Since all the partial products of nine AND gate blockshave the same weight, the 8-bit adder tree simply addsall the partial products. Then the result of the 8-bit treeis directly added to the k-bit left-shifted value in SA. Toprevent the overflow, a wider bit adder should be usedinstead of an 8-b adder tree or the image data shouldbe scaled down before addition.

The 33 convolution operation based on (4) is as fol-lows. First, logic AND gates make nine partial products.Second, the tree adder consisting of eight 8-bit adderssums nine partial products in the braces of (4). Third, SAaccumulates the result of the tree adder sequentially withthe 1-bit left-shifted previous result. The coefficient bitsare shifted to the left so that the partial products canbe sequentially generated from MSB (most significantbit) to LSB (least significant bit). Hence, eight time

4Fig. 7. The filter architecture using multipliers based on (4).

accumulations make one filtering output sample, whichis the same number of accumulations as in Fig.3.Hence, the architecture in Fig.7 uses one 16-bit SAinstead of nine 16-bit SAs as in Fig.6 . In addition, the 8-bit tree adder in Fig.7 instead of the 16- bit tree adder inFigure (6) sums nine partial products before performingthe shift-and-accumulation (in terms of k). Therefore,this architecture requires only 72 two input AND gates,one 8-bit tree adder and one SA.Fig.7 shows the architecture of the proposed filter whichconsists of nine register Units (RUs), and a computationunit (CU). Nine RUs make nine partial products of eachdata and each MSB of a coefficient in a clock cycle. EachRU makes eight partial products of the data and all bitsof the coefficient in eight clock cycles. Hence, RUs make72 partial products (9 partial products per clock cycle).CU performs a summation of nine partial products ina clock cycle and eight shift-and-accumulations of thesummation results in 8 clock cycles.RU consists of an 8-bit data register, an 8-bit coefficientregister and eight logical AND gates. These gates canperform the multiplication with 8-bit data and 1-bitcoefficient. The value in the data register is transferredto the next RU after 8 clock cycles. The coefficient bitsare rotated left at each clock cycle. When the MSB of thecoefficient register is 1, the output of logical AND gatesis the pixel data itself. Similiarly, when the MSB of thecoefficient register is 0, the output of logical AND gatesare all zeros, that is, the zero partial product. Hence RUmakes eight 8-bit partial products of a pixel data anda coefficient sequentially in 8 clock cycles without usingany parallel multiplier.The 8-bit tree adder composed of eight group CLAs(carry look-ahead adders) is a pipelined structure and

sums nine partial products from nine RUs. The SAsums the 8-bit value from the tree adder and the 1-bit left shifted accumulator value. In 8 clock cycles,the SA performs 8 accumulations. Since CU performsthe shift-and-accumulation after a summation of ninepartial products, the operand size of the adders in thetree adder is reduced from 16-bit to 8-bit and the numberof SAS is reduced form 9 to 1 compared with previousarchitectures [7],[10],[12]. Hence, we can significantlyreduce the VLSI area.

III. PERFORMANCE ANALYSIS

A. MWPB SchemeThe main features of the three buffering schemes

discussed in the above section, have been summarizedIn Table I. The depth of FIFO is assumed to be 8 for allbuffering schemes. We have taken an input image of size1024 1024, a convolution mask of size 5 5 and a 32-bit SRAM as the external memory. We have considerda single read (or write) operation will fetch (or store) 4byte-size pixels from (or to) the external memory.Flip-flop count and throughput rate for window sizesfrom 3 3 to 11 11 for the three buffering schemes isgiven in Table II (derived from [8]). We can notice thatthe area utilization for PB schemes depends mainly onthe depth of FIFOs (see Table I). We can also see thatMWPB scheme shows a good tradeoffs between FPGAresources and available external memory bandwidth.

B. Multiplierless 2-D convolverTable 3 shows comparison among computation unit

architectures based on (1), (2) and (3). The gate countsare referred from the Samsung 0.8m SOG cell library(KG60K) data book [14]. The 8-bit x 8-bit multiplier has525 gates, the two input AND gate has 2 gates, and the16-b SA has 320 gates. The 16-b SA is composed of an8-b full adder, an 8-b half adder, and two 16-b registers.The 16-b adder is 186 gates and the 8-bit adder is 88gates. As shown in Table 1, the proposed computationunit can reduce the gate count approximately 80%compared to the computation unit itself based on (1)[12] and approximately 70% compared to computationunits based on (2) [10]-[12].

IV. CONCLUSIONIn this paper we have presented an area efficient and

fast 2-D Convolver structure. In order to make it fast fora given external memory bus bandwidth, we have usedmultiwindowed partial buffering (MWPB) scheme. Inaddition this scheme utilizes less onchip FPGA resource.For achieving high packing density we have used multi-plierless 2-D convolver. This architecture can reduce 80%of chip area with low cost. The proposed architecturecan operate at sufficiently high speed and can prove tobe a practical solution for real time applications. Thisarchitecture achieves less chip area and higher speedapplication at the same time.

5TABLE ICOMPARISION OF DIFFERENT BUFFERING SCHEMES FOR A R X S CONVOLVER.

scheme throughput area utilization bandwidth case study(clock shift register memory pixels (pixel/clock) area utilization bandwidth/pixel) (FF count) (pixel/clock)

FB 1 R S (R 1) (N S) + P 1 32872 1SWPB 1 R S R P R 520 5MWPB 1 (R + S 1) S (R + S 1) P (R + S 1)/S 936 1.8

TABLE IICOMPARISION OF AREA UTILIZATION AND THROUGHPUT RATE.

scheme FB SWPB MWPBarea utilization throughput area utilization throughput area utilization throughput

(ff count) (clock/pixel) (ff count) (clock/pixel) (ff count) (clock/pixel)3 3 16472 1 264 1 440 15 5 32872 1 520 1.5 936 17 7 49272 1 840 2 1560 19 9 65672 1 1224 2.5 2312 1

11 11 82072 1 1672 3 3192 1

TABLE IIICOMPARISONS AMONG COMPUTATION UNITS BASED ON (1), (3) AND (4).

Modules Architecture Based on (1) Architecture Based on (3) Architecture Based on (4)Multipliers Nine 8-b8-b multipliers - -AND gates - 72 two input AND gates 72 two input AND gatesTree adder Eight 16-bit adders Eight 16-bit adders Eight 18-bit adders

Shift-and-Accumulators - Nine 16-bit SAs One 16-b SATotal gate count 6,213 4,512 1,168

REFERENCES

[1] R. C. Gonzalez and R. E.Woods, Digital Image Processing,2nd ed. Englewood Cliffs, NJ: Prentice-Hall, 2002.

[2] C. Torres-Huitzil and M. Arias-Estrada, Real-time imageprocessing with a compact fpga-based systolic architec-ture. Real Time Imaging, vol. 10, no. 3, pp. 177187, jun2004.

[3] B. Bosi, G. Bois, and Y. Savaria,, reconfigurable pipelined2-d convolvers for fast digital signal processing, IEEETrans. Very Large Scale Integr. (VLSI) Syst, vol. 7, no. 3.

[4] S. Perri, M. Lanuzza, P. Corsonello, and G. Cocorullo,A high-performance fully reconfigurable fpga-based 2-dconvolution processor, Microprocess. Microsyst, vol. 29.

[5] X. Liang, J. S. N. Jean, and K. Tomko, data bufferingand allocation in mapping generalized template matchingon reconfigurable systems, J. Supercomput, vol. 19, no. 1,pp. 7791, 2001.

[6] Se Young Eun and Myung H. Sunwoo, An efficient 2-d convolver chip for real-time image processing, DesignAutomation Conference , Proceedings of the ASP-DAC, Asiaand South Pacific, pp. 329330, Feb 1998.

[7] K. Khoo, A. Kwentus, and A. N. Willson, Jr, an efficient175mhz programmable fir digital filter, IEEE Int. ConfCircuits Syst.

[8] F. Cardells-Tormo and P. Molinet,, Area-efficient 2-d shift-variant convolvers for fpga-based digital imageprocessing, IEEE Trans. Circuits. Syst. II: Exp. Briefs,vol. 53, no. 2, pp. 105109, Feb 2006.

[9] Hui Zhang, Mingxin Xia, and Guangshu Hu, A mul-tiwindow partial buffering scheme for fpga-based 2-dconvolvers, IEEE Trans. Circuits Syst.

[10] Woo Jin Oh and Yong Hoon Lee, implementation ofprogrammable multiplierless fir filters with power-of-twocoefficient,, IEEE Trans. Circuits Syst., vol. 42.

[11] T. Yoshino, R. Jain, P. T. Yang, H. Davis, W. Gass, and A.H. Shah,, A 100-mhz 64-tap fir digital filter in 0.8-mmbicmos gate array, IEEE J. Solid-state Circuit, vol. 25, pp.14941501, dec 1990.

[12] HARRIS semiconductor Inc., Digital Signal Processing,1994.

[13] Xilinx Inc. , Spartan-3 FPGA Family: Complete Datasheet[Online]. Available: http://www.xilinx.com/, .

[14] Samsung Electronics, SEC KGL 60K Cell Library DataBook, 1995.

ieee paper

Documents