overview september 2004

1 MAPLD 2005/P200Chandrasekaran

OverviewSeptember 2004

FPGA Implementation of Reduced Bit FPGA Implementation of Reduced Bit Plane Motion EstimationPlane Motion Estimation

Shrutisagar Chandrasekaran, Abbes Amira and Faycal Shrutisagar Chandrasekaran, Abbes Amira and Faycal BensaaliBensaali


Outline

Research Objectives

Introduction

Reduced Bit-Plane Motion Estimation

Proposed Architecture

FPGA Implementations and Results

Conclusions

Future Work and Acknowledgments


Research Objectives

To evaluate and model power consumption of FPGA based designs at various levels of abstraction and to evolve and implement strategies for low power energy efficient design

To develop efficient low power architectures for image processing techniques such as Motion Estimation (ME)

To efficiently implement a reduced bit plane motion estimation algorithm on FPGA using Handel-C for onboard video compression


Block Matching (BM) is a widely used Motion Estimation (ME) technique for calculating motion vectors by minimising some cost functions Optimal prediction is obtained when a Full Search (FS) algorithm is performed FS algorithm is computationally intensive and requires a large number of I/O pins and large bandwidth for real time ME An effective method for reducing the complexity of ME architecture is to reduce the number of bit planes used for computing the motion vector

Introduction


Introduction

Most of the motion information is the 6th bit plane and a significant amount of the motion information is also available in the 7th bit plane

The lower bit planes contain significantly less motion information as they represent the smooth areas of the image

Reduce bit-plane methods for ME using a range of arithmetic units and simple Boolean operations leads to power and area efficient architectures


Reduced Bit-Plane ME

for i=1:dim:M-dim+1, for j=1:dim:N-dim+1, ii=i+d; jj=j+d; window=previous_frame(ii-d:ii+dim+d-1,jj-d:jj+dim+d-1); [m,n]=size(window); I=1; J=1; val=sum(sum((current_frame(i:i+dim-1,j:j+dim-1)-window(I:I+dim-1,J:J+dim-1)).^2)); for l=1:m-dim+1, for k=1:n-dim+1, val_t=sum(sum(abs(current_frame(i:i+dim-1,j:j+dim-1)-window(l:l+dim-1,k:k+dim-1)).^2)); if val_t<val, I=l; J=k; val=val_t; end end end I=I-d-1; J=J-d-1; vec=[vec;I,J]; endend

Where --dim : block sized : border extension for window (square window)vec : array of motion vectors

Pseudo Code



Control Unit

31 0

015

Adder 0

5 bits5 bits

Adder 15

Adder

9 bits

PSU: Processor Sub-Unit

PSU 2

Comparator

9 bits

Least input location

9 bits

RegisterRegister

Comparator

New min

2 bits

5 bits

Intermediate motionVectors 5 bits

Final motion Vectors

5 bits

16

Bits

16

Bits



The architecture exploits the massive parallelism available in hardware to reduce the computation time

The search window is stored on-chip in an array of 32 bit wide registers, the width of each register being equal to the size of the search window

The block size is taken to be 16x16 bits (1 Bit Per Pixel), and is stored on-chip in an array of registers

Each Processing Sub-Unit (PSU) contains 256 Processor Elements PEs (256 XOR + 16 5-bit Adders) for parallel execution of the block matching and estimate the SAD (Sum of Absolute Differences)



2 PSUs are used to cover the entire search window by means of bitwise shift of the contents of the search window in horizontal and vertical directions

The intermediate values of motion vectors are stored in the on chip array, with one location for each PSU

At the end, the global values of motion vectors are obtained using the intermediate values and the output of the comparators



Architecture Nbre of PEs Throughput Search range

Proposed 256 1 MV/308 cycles [ -8, 7 ]

[1] 1024 1MV/256 cycles [ -16, 15 ]

[2] 256 1 MV/496 cycles [ -8, 7 ]

[3] 256 1 MV/2209 cycles [ -8, 7 ]

The proposed architecture yields improved performance metrics when compared to other existing work

[1] Y-H. Yeh and C-Y. Lee, IEEE Trans. VLSI Syst. 7, 345 (1999) [2] T. Komarek and P. Pirsch, IEEE Trans. Circuits Syst. 36, 1301 (1989)[3] C-H. Hseih and T-P. Lin, IEEE Trans. Circuits Syst. Video Technol. 2, 169 (1992)



In order to verify the performance of the proposed architectures, designs have been prototyped on the Celoxica RC1000 board containing the Xilinx XCV2000E FPGA

Available on chip logic resource include - Slices : 19200 - CLB Array : 80 x 120 - Block RAM : 655,360 bits - Distributed RAM : 614,400 bits

The RC1000 has 4 memory banks which communicate with the host by means of DMA transfers



Design Flow



Handel-C adds constructs to ANSI-C to enable DK to directly implement hardware Fully synthesizable HW programming language based on ANSI-C Implements C algorithm direct to optimized FPGA or outputs RTL

from C

Control statements(if, switch, case, etc.)

Integer ArithmeticFunctionsPointers

Basic types(Structures, Arrays etc.)

#define#include

ParallelismTiming

InterfacesClocks

Macro pre-processorRAM/ROM

Shared expressionCommunicationsHandel-C libraries

FP libraryBit manipulation

RecursionSide effects

Standard librariesMalloc

Software-only ANSI-C constructs

Majority of ANSI-C constructs supported by DK

Handel-CAdditions for hardware



8x8 Blocks

16 pixels

16

pix

els

Bank0

Bank1Bank2

Bank3

XCV2000E

Reduced Bit-Plane ME

Motion Vectors


The bit-plane values from the current frame are sent from the host to the SRAM Bank 0, and those from the previous frame are sent as 16 bit values to the SRAM Bank 1

The motion vectors are computed by the ME core and stored in the SRAM Bank 3

The host application reads the motion vectors and generates the predicted image in real time



The proposed architecture is area efficient, as the motion estimation is performed on a single bit plane, requiring compact logic and greatly reduced on-chip memory size

The architecture is efficient, compact and can be massively parallelised as the PE contains simple 1-bit XOR gates only

Memory access is greatly reduced due to use of single bit plane only, saving considerable amount of I/O power




This, along architectural level optimisations including parallelism and pipelining yield power efficient implementation

Implementation is carried out on the Celoxica RC1000 board equipped with Xilinx XCV2000E FPGA, as well as synthesised on Xilinx QPro Virtex-II FPGA

Results in terms of power/area/maximum frequency show that using reduced bit planes instead of full resolution images drastically reduces the FPGA resources used



Various performance metrics of the RBFSBM algorithm implemented on the Virtex-E and the QPro Virtex-II FPGAs

PerformanceMetrics

Virtex-E QPro Virtex-II

Area Occupied (slices) 1500 1488

Max Frequency (MHz) 43.57 76.247

Max Power (mW) 432.65 227.31

Energy/CIF Frame (mJ) 1.934 1.40

Max Throughput (FPS) 89.305 173


Conclusions

A reduced bit plane architecture for full search block matching has been proposed

The proposed architecture is low power, area efficient and suitable for VLSI/FPGA implementation

The developed architecture can be used for space applications such as onboard video compression, video conferencing, etc.


Future work and Acknowledgments

Develop Complete on-chip compression engine for real-time video compression, with applications ranging from onboard satellite compression to video conferencing

Explore the effect of Algorithmic, architectural and RTL level optimisations to minimise power consumption

Acknowledgments

Celoxica (Mr. Roger Gook) and EPSRC for supporting this work Celoxica (Mr. Roger Gook) and EPSRC for supporting this work

overview september 2004

Documents