[ieee 2010 18th ieee/ifip international conference on vlsi and system-on-chip (vlsi-soc) - madrid,...

An Adaptive Bilateral Motion Estimation Algorithm and

its Hardware Architecture

Abdulkadir Akin, Mert Cetin, Burak Erbagci, Ozgur Karakaya, and Ilker Hamzaoglu

Electronics Engineering, Sabanci University

Istanbul, Turkey

{abdulkadir, mertc, berbagci, ozgurkarakaya}@su.sabanciuniv.edu, [email protected]

Abstract—In this paper, we propose an adaptive bilateral motion

estimation (Bi-ME) algorithm for frame rate up-conversion of

High Definition (HD) video. The proposed algorithm can be used

as a refinement step after a true motion estimation algorithm. It

refines the motion vector field between successive frames by

employing a spiral search pattern and by adaptively assigning

weights to candidate search locations. In addition, we propose a

high performance hardware architecture for implementing the

proposed Bi-ME algorithm. The proposed hardware uses an

efficient memory organization and a novel data reuse scheme in

order to reduce the memory bandwidth and control overhead.

The proposed hardware consumes 24% of the slices in a Xilinx

2V8000FF1517-5 FPGA. It can work at 107 MHz in the same

FPGA and is capable of processing 124 1920x1080 full HD frames

per second (fps), therefore doubling the frame rate to 248 fps.

Keywords- Bilateral Motion Estimation, True Motion Estimation,

Frame Rate Up-conversion, Hardware Implementation, FPGA.

I. INTRODUCTION

With the advancement in display and video technologies, the demand and availability of large flat panel High Definition Television (HDTV), and PC displays with up to 100, 120 and most recently 240 Hz frame rates are increasing. On the other hand movie materials are recorded at 24, 25 or 30 frames per second and HDTV and various other video materials have 50 or 60 Hz temporal resolutions. To display these formats correctly on high frame rate panels, new frames should be generated and inserted into the original sequence to increase its frame rate. Therefore, Frame Rate Up-Conversion (FRUC) has become a necessity [1]. The existing FRUC algorithms are mainly classified into two types [2]. First class of FRUC algorithms does not take motion of the objects into account, like frame repetition or linear interpolation. However at high spatial and temporal resolutions, these algorithms produce visual artifacts like motion blur and judder. Second class of FRUC algorithms takes the motion of objects into account to reduce these artifacts and construct better interpolated frames.

Motion Compensated Frame Rate Up-Conversion (MC-FRUC) algorithms consist of two main stages, Motion Estimation (ME) and motion compensated interpolation (MCI). In ME, a Motion Vector (MV) is calculated between successive frames, and in the MCI step an interpolated frame is generated using motion vector data obtained from the previous step. Among several ME algorithms, Block Matching (BM) is the most preferred method, because of its easy implementation and compact representation of the motion field. For MC-FRUC

Figure 1. MC-FRUC System

applications, it is important that the motion vectors represent real motions of the objects. This is called true motion. ME algorithms finding the best Sum of Absolute Difference (SAD) match are sufficient for video compression application. However, the motion vectors giving best SAD match may not represent the true motion of the objects. Therefore, these ME algorithms, in general, perform poorly for MC-FRUC applications [1].

In MC-FRUC, interpolated frames are generated by performing interpolation between reference frames (RF) and current frames (CF) based on MVs obtained by a true ME algorithm. These MVs are obtained by ME process which assumes that objects move along the motion trajectory. However, during this process holes and overlapped areas may be produced in the interpolated frames due to no motion trajectory passing through and multiple motion trajectories passing through, respectively [3]. This degrades the quality of generated frames. This problem can be solved by median filtering overlapped pixels [4], and using spatial interpolation methods for holes [5], or prediction methods by analyzing MV fields for covered and uncovered regions [3][6]. However, these methods require complex operations and give unsatisfactory results in cases of non-static backgrounds and camera motions.

Bilateral ME (Bi-ME) algorithms are recently proposed to avoid holes and overlapped areas in interpolated frames more effectively [7]-[10]. Bi-ME algorithms construct a MV field for the interpolated frame and do not produce any overlapped areas or holes during interpolation. As shown in Fig. 1, Bi-ME algorithms can be used as a refinement step after ME [7].

In this paper, we propose an adaptive Bi-ME algorithm for FRUC of HD video. The proposed algorithm can be used as a refinement step after a true ME algorithm. There are several ME algorithms which aim to extract the true motion between

207978-1-4244-6471-5/10/$26.00 c©2010 IEEE

the frames of video sequences. 3D Recursive Search (3DRS) [11] is one of the best true motion estimation algorithms. Therefore, we used 3DRS algorithm for generating the initial MV field.

The proposed Bi-ME algorithm refines the motion vector field between successive frames by employing a spiral search pattern and by adaptively assigning weights to candidate search locations. Spiral search pattern [12] is commonly used for ME in video compression applications [13]. However, we used the spiral search pattern for Bi-ME for the first time in the literature. The proposed Bi-ME algorithm searches the best SAD match in the Bilateral Search Window (BSW) starting from the center and evaluates the candidate search locations by assigning weights. It conserves the true motion property of the motion vector field by favoring the candidate search locations near the center where the initial MV points to.

In addition, we propose a high performance hardware architecture for implementing the proposed adaptive Bi-ME algorithm. To the best of our knowledge, this is the first Bi-ME hardware implementation in the literature. In conventional BM ME algorithms, current macroblock (MB) pixels do not change during the search process of a MB, only the reference MB pixels change for each search location. However, in Bi-ME algorithms, both current MB pixels and reference MB pixels change during the search process. This increases control overhead and memory accesses. The all connected 256 processing element (PE) systolic array, ladder type memory organization, symmetric data placement in memory and data alignment techniques proposed in this paper reduces the amount of memory accesses by enabling high amounts of data reuse.

The rest of the paper is organized as follows. Section II explains the proposed Bi-ME algorithm and presents the experimental results. Section III explains the proposed hardware architecture for implementing this algorithm. Section IV concludes the paper.

II. PROPOSED BILATERAL ME ALGORITHM

As shown in Fig. 1, the proposed adaptive Bi-ME algorithm refines MVs found by a true ME algorithm to improve the FRUC quality. As shown in Fig. 2, the search process is performed by calculating the SAD between the 16x16 current MB and 16x16 reference MB at each candidate search location in their respective BSWs. After the SAD value for a search location is calculated, current MB and reference MB moved symmetrically in the CF BSW and the RF BSW.

The proposed Bi-ME algorithm determines the BSWs in CF and RF same as the Bi-ME algorithm proposed in [7]. The first difference between the initial MV refinement step in [7] and our algorithm is that we use [-4, +4] search range instead of [-2, +2]. The second difference is that instead of the row by row or column by column pattern of the full search, we use weighted spiral search.

As it is proposed in [7]-[10], for each search location, the SAD between corresponding current and reference MBs in the CF and RF is calculated. However, we used different weights (W) during the spiral search as the distance from center is increasing. Therefore, instead of the sum of bilateral absolute

difference (SBAD) criterion proposed in [8], we used weighted SBAD (WSBAD) criterion for search location comparison (4). The initial MVs for the intermediate frame are refined by searching all the candidate search locations in BSWs to find the search location that gives the minimum WSBAD.

SBAD and WSBAD formulas are shown in (1), (2), (3) and (4) where Bi,j is the interpolated 16x16 MB in fn-1/2 , s is a pixel position in Bi,j , Vinitial is the initial MV found by the ME step of MC-FRUC system, v is a candidate search location for Bi-ME.

vVsS initialf −−= 2 (1)

vVsS initialb ++= 2 (2)

[ ] [ ] [ ]∑∈

− −=jiBs

fnbnji SfSfvBSBAD,

1, , (3)

[ ] [ ] [ ]∑∈

− −+=jiBs

fnbnji SfSfWvBWSBAD,

1, , (4)

Two main criteria are used for adaptively changing the weight coefficients, L1 norm of the initial MV and the spatial distance of the search location to the center of the BSW. The initial MVs are classified into 3 types based on their L1 norm, low motion (LM), intermediate motion (IM) and high motion (HM) as shown in (5). The spatial distances are classified into 5 types (center, SD1, SD2, SD3 and SD4) as illustrated for the top-left pixel of a MB in Fig. 3. The weight coefficients for different cases are shown in Table I. Since initial MVs are assumed to represent the true motion of the objects, larger coefficients are used as the spatial distances are increasing in order to favor the search locations near the center. If L1 norm of the initial MV is large, selection of the search locations with higher spatial distances is allowed by using smaller weight coefficients.

⎪⎪⎩

⎪⎪⎨

⎧

≤

<≤

<≤

=

1

1

1

19

193,

30,

, MVif

MVifIM

MVifLM

TypeVector

HM

(5)

TABLE I.

WEIGHT COEFFICIENTS FOR DIFFERENT CASES

Center SD1 SD2 SD3 SD4

LM 0 1000 2500 4000 5500

IM 0 1000 2000 3000 4000

HM 0 0 500 1000 1500

Figure 3. Spatial Distances for Spiral Bi-ME

208 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010)

Figure 2. Bilateral Motion Vector Refinement. fn-1: Reference Frame, fn-1/2: Interpolated Frame, fn : Current Frame

TABLE II.

PSNR RESULTS OF FRUC ALGORITHMS

Frame Size

Zero Motion

A B Prop.

Football 352x240 19.89 20.56 21.39 21.64

Mobile 352x288 25.23 27.74 26.29 28.08

Foreman 352x288 29.86 31.85 33.12 33.13

Spiderman 720x576 23.69 23.98 24.29 24.39

Irobot 720x576 23.49 24.33 24.55 24.74

Gladiator 720x576 22.06 22.90 23.77 24.26

ParkJoy 1280x720 20.11 24.10 24.11 24.32

SthlmPan 1280x720 23.96 34.00 34.13 34.69

NewMobCal 1280x720 29.76 33.70 33.85 34.82

ParkJoy 1920x1080 20.15 24.27 23.57 24.40

CrowdRun 1920x1080 24.24 27.16 28.56 28.65

DucksTakeOff 1920x1080 29.74 29.86 29.37 29.93

The Peak Signal-to-Noise Ratio (PSNR) results in dB obtained by three different FRUC algorithms for several video sequences are shown in Table II. Non-motion compensated frame interpolation (Zero Motion) results are also given as reference. Spiderman, Irobot, and Gladiator video sequences are taken from Spiderman II, Irobot and Gladiator movies. The other video sequences are commonly used benchmark videos.

A, B, and Prop algorithms use motion vector fields generated by 3DRS ME algorithm between previous and current frames [15], and use motion compensated field averaging (MC-FAVG) [1] for MCI. A does not use any refinement algorithm. B uses the bilateral refinement algorithm proposed in [7] which uses symmetric full search for a given search range. Prop uses the proposed refinement algorithm. [-4, +4] search range is used for B instead of the original [-2, +2].

For evaluating the performances of the FRUC algorithms, even numbered frames are removed from the original video

sequence lowering the frame rate by a factor of 2. Then, ME and MCI are applied to each frame pair to interpolate the intermediate frames in order to obtain the original frame rate. Then, the PSNR between the original even numbered frames and the interpolated frames is computed. In this experiment, 50 odd frames of each video sequence are used for compensating 49 even frames for doubling the frame rate.

The results show that the proposed algorithm produces better PSNR results for all video sequences than basic 3DRS algorithm and previous bilateral refinement algorithm.

III. PROPOSED HARDWARE ARCHITECTURE

The block diagram of the proposed adaptive Bi-ME hardware is shown in Fig. 4. The hardware is composed of 16 BRAMs, 2 Vertical Rotators, 2 Horizontal Splitters, 2 Horizontal Shifters, PE Array, Control Unit, Adder Tree and Comparator & MV Updater. The hardware finds refined MV of a 16x16 MB using the proposed adaptive Bi-ME algorithm in a [-4, 4] pixel search range. Its latency is 9 clock cycles; 1 cycle for Control Unit, 1 cycle for synchronous read from memory, 1 cycle for Vertical Rotator, 4 cycles for Adder Tree and 2 cycles for Comparator & MV Updater. The Control Unit generates the required address and control signals to compute the WSBAD values of candidate search locations in BSWs in RF and CF.

After the WSBAD value of a search location is calculated, Comparator & MV Updater compares this WSBAD value with the minimum WSBAD value in order to determine the search location that produces minimum WSBAD value and the corresponding refined MV.

There are large intersections between the BSWs of the consecutive search locations of 16x16 MBs in RF and CF. Therefore, performing Bi-ME for these MBs using a systolic ME hardware allows significant data reuse.

2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010) 209

Figure 4. Top-level Block Diagram of Proposed Bi-ME Hardware

A. Systolic PE Array and Data Reuse Scheme

There are 256 PEs in the PE array. The architecture of PE array is shown in Fig. 5. After PE array computes the SBAD value of a search location, it computes the SBAD value of the next search location in the same line or column. The BSW pixels needed for computing the SBAD value of the first search location are loaded from BRAMs into PE arrays. PE arrays reuse BSW pixels for computing the SBAD values of the neighboring search locations. Each PE is connected to all of its neighboring PEs in order to shift the BSW pixels to one of the four directions (up, down, left, right). Therefore, after the first search location, PE array needs 32 new pixels (16 pixels for reference MB and 16 pixels for current MB) for computing the SBAD value of each search location.

The architecture of a PE is shown in Fig. 6. Each PE performs an absolute difference operation between a CF BSW pixel and a RF BSW pixel. The results of the absolute difference operations performed by all PEs in the PE array for a search location in the BSW are added by an Adder Tree in order to compute the SBAD value of that search location.

Each PE contains two registers, one for storing a CF pixel and one for storing a RF pixel. Each flip-flop is connected to the flip-flops in the four neighboring PEs to reuse the pixel value for a new search location. The current MB and reference

MB are moved symmetrically in the BSWs for the next search location. Therefore, in order to reduce the control overhead of the multiplexers, neighboring PEs are inversely connected to the multiplexers, and all 2x256 multiplexers are controlled by the same shifting direction signal. For example, when the shifting direction signal is 3, all 256 PEs shift the CF pixels to the PEs at the right and the RF pixels to the PEs at the left. PEs that are at the edge of the PE array are connected to the proper Horizontal Splitters or Horizontal Shifters and neighboring PEs.

Data alignment for the PEs is achieved by using 2 Vertical Rotators, 2 Horizontal Splitters and 2 Horizontal Shifters. Uppermost and lowermost 16 PEs in the PE array receive the new pixels from Horizontal Shifters. Rightmost and leftmost 16 PEs in the PE array receive the new pixels from Horizontal Splitters. 1 Vertical Rotator, 1 Horizontal Splitter and 1 Horizontal Shifter are used for data alignment of RF BSW pixels. 1 Vertical Rotator, 1 Horizontal Splitter and 1 Horizontal Shifter are used for data alignment of CF BSW pixels.

B. Memory Organization and Data Alignment

The memory organization of the proposed Bi-ME hardware is shown in Fig. 7 and Fig. 8. The proposed memory organization is based on the ladder shaped memory organization presented in [14] for implementing four step search (FSS) algorithm [16]. It enables access to the horizontally and vertically adjacent pixels in one clock cycle which is called 2-D random access.

Both horizontally and vertically adjacent pixels of reference MB and current MB can be read with one clock cycle latency using the proposed ladder-shaped BSW data organization. There are several differences between ladder shaped memory organization presented in [14] and the proposed memory organization. Each address of a BRAM in [14] contains one pixel, whereas in the proposed hardware each address of a BRAM contains four pixels. Since there is only one SW in a conventional ME algorithm, one SW is used in [14]. However, the proposed hardware has two BSWs which require more control overhead. 16 dual-port BRAMs in the FPGA are used to store the two 24x24 BSWs. All BRAMs contain pixels from both BSWs.

In Fig. 7, the numbers show which BRAM contains the corresponding pixel in BSWs. The proposed memory organization enables accessing the same BRAMs for getting the new BSW pixels of current MB and reference MB to the PE array. The control overhead of address signals used for reading from BRAMs and the control overhead of Vertical Rotators, Horizontal Splitters, Horizontal Shifters and the multiplexers in the PEs are reduced by symmetric arrangement of RF and CF pixels in the BRAMs.

As it can be seen in Fig. 7, while the spiral search pattern is in vertical direction the new pixels come from at most 5 different BRAMs, and while the spiral search pattern is in horizontal direction the new pixels come from all 16 BRAMs. Since the MB size is 16x16 pixels, using 16 BRAMs guarantees loading the pixels from different BRAMs for every horizontal search pattern of the spiral search.


Figure 5. All Connected PE Array

Figure 6. PE Architecture

Dual-port BRAMs are used in the proposed architecture and BSWs of CF and RF are loaded in parallel. Two address signals are sent to the two ports of BRAMs with an offset value. The address values for the BSWs are shown in Fig. 8. In Fig. 8, p and L denote pixel and line numbers respectively as shown in Fig 7. The address values above the bold line are for the pixels in CF BSW, and the address values below the bold line are for the pixels in RF BSW. The offset values of the BRAMs are the address numbers below the bold line. For example, the offset value of BRAM 1 is 8. Therefore, for the fourth search location, while address 7 is sent to BRAM 1 for getting the CF pixel, in the same cycle address 15 is sent to BRAM 1 for getting the RF pixel. In this way, the cost of calculating the address value for the two BSWs is decreased.

Two Vertical Rotators are used to rotate the BSW pixels read from the BRAMs in order to match them with the corresponding PEs in the PE array. Two Horizontal Splitters which are controlled by split pixel signal are used to select the required pixels from the outputs of the two Vertical Rotators. Each Horizontal Splitter is connected to one Vertical Rotator.

Horizontal Splitters are used to align the pixels while the search pattern is in horizontal direction. Two Horizontal Shifters which are controlled by shift amount signal are used to align the pixels while the search pattern is in vertical direction.

C. Implementation Results

The proposed Bi-ME hardware architecture is implemented in Verilog HDL. The Verilog RTL codes are synthesized to a Xilinx 2V8000FF1152-5 FPGA using Mentor Graphics Precision RTL 2005b and mapped to the same FPGA using Xilinx ISE 8.2i. The hardware implementations are verified with post place & route simulations using Mentor Graphics Modelsim 6.1c.

The hardware implementation consumes 11602 slices (22118 LUTs and 7275 DFFs), which is 24% of all slices of a 2V8000FF1517-5 FPGA. PE array consumes 6810 slices (12544 LUTs), one of the two Vertical Rotators consumes 1178 slices (2048 LUTs), one of the two Horizontal Splitters consumes 294 slices (512 LUTs), one of the two Horizontal Shifters consumes 294 slices (512 LUTs), Adder Tree consumes 1969 slice (2287 LUTs) and the remaining slices are used for Comparator & MV Updater, Control Unit and multiplexers before address ports of the BRAMs. In addition, 9216 bits on-chip memory is used for storing two BSWs, one in CF and one in RF. These 9216 bits are stored in 16 BRAMs.

The proposed hardware has 16 clock cycles initial latency for starting the search. Because of the [-4, 4] search range, there are 81 search locations. Therefore, in each clock cycle in the following 81 clock cycles after the initial latency, PE array starts to compute WSBAD value of a new search location. 9 stage pipelining causes 9 clock cycles latency. Therefore, 16 + 81 + 9 = 106 clock cycles are required by Bi-ME hardware for implementing the proposed Bi-ME algorithm. The proposed hardware can work at 107 MHz on the same FPGA after place & route. Therefore, it is capable of processing 124 1920x1080 full HD frames per second doubling the frame rate to 248 fps.

2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010) 211

(a) RF (b) CF

Figure 7. Memory Organization in Proposed Bi-ME Hardware

Figure 8. Data Allocation in BRAMs

IV. CONCLUSION

In this paper, we proposed an adaptive Bi-ME algorithm for FRUC of HD video. The proposed algorithm can be used as a refinement step after a true ME algorithm. It refines the motion vector field between successive frames by employing a spiral search pattern and by adaptively assigning weights to candidate search locations. In addition, we proposed a high performance hardware architecture for implementing the proposed Bi-ME algorithm. The proposed hardware uses an efficient memory organization and a novel data reuse scheme in order to reduce the memory bandwidth and control overhead. The proposed hardware consumes 24% of the slices in a Xilinx 2V8000FF1517-5 FPGA. It can work at 107 MHz in the same FPGA and is capable of processing 124 1920x1080 full HD fps doubling the frame rate to 248 fps.

REFERENCES

[1] G. de Haan, Video Processing for Multimedia Systems. Univ. Press

Eindhoven, ISBN 90-9014015-8, 2001.

[2] O. A. Ojo and G. de Haan, “Robust motion-compensated video

upconversion,” IEEE Trans. Consum. Electron., vol. 43, no. 4, pp. 1045-

1056, Nov. 1997.

[3] B.-W. Jeon, G.-I. Lee, S.-H. Lee, and R.-H. Park, “Coarse-to-fine frame

interpolation for frame rate up-conversion using pyramid structure,”

IEEE Trans. Consum. Electron., vol. 49, no.3, pp. 499-508, Aug. 2003.

[4] T.Y. Kuo and C.-C.J. Kuo, “Motion-compensated interpolation for low-bit-rate video quality enhancement,” in Proc. SPIE Visual

Communications and Image Processing, vol. 3460, pp. 277-288, July

1998.

[5] A. Kaup and T. Aach, “Efficient prediction of uncovered background in

interframe coding using spatial extrapolation,” in Proc. ICASSP., vol. 5,

pp, 501-504, 1994.

[6] R. J. Schutten and G. D. Haan, “Real-time 2–3 pull-down elimination

applying motion estimation/compensation in a programmable device,”

IEEE Trans. Consum.Electron., vol. 44, no. 3, pp. 501–504, Aug. 1998. [7] B-T. Choi, S-H. Lee, and S-J. Ko, “New frame rate up-conversion using

bi-directional motion estimation,” IEEE Trans. Consum. Electron., vol.

46, no.3, pp. 603-609, Aug. 2000.

[8] B-D. Choi, J-W. Han, C-S. Kim, and S-J. Ko, “Motion-compensated

frame interpolation using bilateral motion estimation and adaptive

overlapped block motion compensation,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 4, pp. 407-416, Apr. 2007.

[9] S-J. Kang, K-R. Cho, and Y. H. Kim, “Motion compensated frame rate

up-conversion using extended bilateral motion estimation,” IEEE Trans.

Consum. Electron., vol. 53, no.4, pp. 1759-1767, Nov. 2007.

[10] S-J. Kang, D-G. Yoo, S-K. Lee, and Y. H. Kim, “Multiframe-based

bilateral motion estimation with emphasis on stationary caption

processing for frame rate up-conversion,” IEEE Trans. Consum.

Electron., vol. 54, no.4, pp. 1830-1838, Nov. 2008.

[11] G. de Haan, P. W. A. C. Biezen, H. Huijgen, and O. A. Ojo, “True-

motion estimation with 3-D recursive search block matching,” IEEE

Trans. Circuits Syst. Video Technol., vol. 3, no. 5, pp. 368-379, Oct. 1993.

[12] R. W. Hall, “Efficient spiral search in bounded spaces,” IEEE Trans.

Pattern Anal. Mach. Intel., vol. 4, no. 2, pp. 208-215, March 1982.

[13] ITU-T recommendation H.263 software implementation, Digital Video

Coding Group, Telenor R&D, 1995.

[14] T. Chen, Y. Chen, S. Tsai, S. Chien, and L. Chen, “Fast Algorithm and Architecture Design of Low Power Integer Motion Estimation for

H.264/AVC” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 5,

May 2007.

[15] G. de Haan, “Progress in motion estimation for consumer video format

conversion,” IEEE Trans. Consum. Electron., vol. 46, no. 3, pp. 449-

459, Aug. 2000.

[16] L.-M. Po and W.-C. Ma, “A novel four-step search algorithm for fast

block motion estimation,” IEEE Trans. Circuits Syst. Video Technol.,

vol. 6, no. 3, pp. 313–317, Jun. 1996.


[ieee 2010 18th ieee/ifip international conference on vlsi and system-on-chip (vlsi-soc) - madrid,...

Documents