overview september 2004
DESCRIPTION
FPGA Implementation of Reduced Bit Plane Motion Estimation Shrutisagar Chandrasekaran, Abbes Amira and Faycal Bensaali. Overview September 2004. Outline. Research Objectives Introduction Reduced Bit-Plane Motion Estimation Proposed Architecture FPGA Implementations and Results - PowerPoint PPT PresentationTRANSCRIPT
1 MAPLD 2005/P200Chandrasekaran
OverviewSeptember 2004
FPGA Implementation of Reduced Bit FPGA Implementation of Reduced Bit Plane Motion EstimationPlane Motion Estimation
Shrutisagar Chandrasekaran, Abbes Amira and Faycal Shrutisagar Chandrasekaran, Abbes Amira and Faycal BensaaliBensaali
2 MAPLD 2005/P200Chandrasekaran
Outline
Research Objectives
Introduction
Reduced Bit-Plane Motion Estimation
Proposed Architecture
FPGA Implementations and Results
Conclusions
Future Work and Acknowledgments
3 MAPLD 2005/P200Chandrasekaran
Research Objectives
To evaluate and model power consumption of FPGA based designs at various levels of abstraction and to evolve and implement strategies for low power energy efficient design
To develop efficient low power architectures for image processing techniques such as Motion Estimation (ME)
To efficiently implement a reduced bit plane motion estimation algorithm on FPGA using Handel-C for onboard video compression
4 MAPLD 2005/P200Chandrasekaran
Block Matching (BM) is a widely used Motion Estimation (ME) technique for calculating motion vectors by minimising some cost functions Optimal prediction is obtained when a Full Search (FS) algorithm is performed FS algorithm is computationally intensive and requires a large number of I/O pins and large bandwidth for real time ME An effective method for reducing the complexity of ME architecture is to reduce the number of bit planes used for computing the motion vector
Introduction
5 MAPLD 2005/P200Chandrasekaran
Introduction
Most of the motion information is the 6th bit plane and a significant amount of the motion information is also available in the 7th bit plane
The lower bit planes contain significantly less motion information as they represent the smooth areas of the image
Reduce bit-plane methods for ME using a range of arithmetic units and simple Boolean operations leads to power and area efficient architectures
6 MAPLD 2005/P200Chandrasekaran
Reduced Bit-Plane ME
for i=1:dim:M-dim+1, for j=1:dim:N-dim+1, ii=i+d; jj=j+d; window=previous_frame(ii-d:ii+dim+d-1,jj-d:jj+dim+d-1); [m,n]=size(window); I=1; J=1; val=sum(sum((current_frame(i:i+dim-1,j:j+dim-1)-window(I:I+dim-1,J:J+dim-1)).^2)); for l=1:m-dim+1, for k=1:n-dim+1, val_t=sum(sum(abs(current_frame(i:i+dim-1,j:j+dim-1)-window(l:l+dim-1,k:k+dim-1)).^2)); if val_t<val, I=l; J=k; val=val_t; end end end I=I-d-1; J=J-d-1; vec=[vec;I,J]; endend
Where --dim : block sized : border extension for window (square window)vec : array of motion vectors
Pseudo Code
7 MAPLD 2005/P200Chandrasekaran
Proposed Architecture
Control Unit
31 0
015
Adder 0
5 bits5 bits
Adder 15
Adder
9 bits
PSU: Processor Sub-Unit
PSU 2
Comparator
9 bits
Least input location
9 bits
RegisterRegister
Comparator
New min
2 bits
5 bits
Intermediate motionVectors 5 bits
Final motion Vectors
5 bits
16
Bits
16
Bits
8 MAPLD 2005/P200Chandrasekaran
Proposed Architecture
The architecture exploits the massive parallelism available in hardware to reduce the computation time
The search window is stored on-chip in an array of 32 bit wide registers, the width of each register being equal to the size of the search window
The block size is taken to be 16x16 bits (1 Bit Per Pixel), and is stored on-chip in an array of registers
Each Processing Sub-Unit (PSU) contains 256 Processor Elements PEs (256 XOR + 16 5-bit Adders) for parallel execution of the block matching and estimate the SAD (Sum of Absolute Differences)
9 MAPLD 2005/P200Chandrasekaran
Proposed Architecture
2 PSUs are used to cover the entire search window by means of bitwise shift of the contents of the search window in horizontal and vertical directions
The intermediate values of motion vectors are stored in the on chip array, with one location for each PSU
At the end, the global values of motion vectors are obtained using the intermediate values and the output of the comparators
10 MAPLD 2005/P200Chandrasekaran
Proposed Architecture
Architecture Nbre of PEs Throughput Search range
Proposed 256 1 MV/308 cycles [ -8, 7 ]
[1] 1024 1MV/256 cycles [ -16, 15 ]
[2] 256 1 MV/496 cycles [ -8, 7 ]
[3] 256 1 MV/2209 cycles [ -8, 7 ]
The proposed architecture yields improved performance metrics when compared to other existing work
[1] Y-H. Yeh and C-Y. Lee, IEEE Trans. VLSI Syst. 7, 345 (1999) [2] T. Komarek and P. Pirsch, IEEE Trans. Circuits Syst. 36, 1301 (1989)[3] C-H. Hseih and T-P. Lin, IEEE Trans. Circuits Syst. Video Technol. 2, 169 (1992)
11 MAPLD 2005/P200Chandrasekaran
FPGA Implementations and Results
In order to verify the performance of the proposed architectures, designs have been prototyped on the Celoxica RC1000 board containing the Xilinx XCV2000E FPGA
Available on chip logic resource include - Slices : 19200 - CLB Array : 80 x 120 - Block RAM : 655,360 bits - Distributed RAM : 614,400 bits
The RC1000 has 4 memory banks which communicate with the host by means of DMA transfers
12 MAPLD 2005/P200Chandrasekaran
FPGA Implementations and Results
Design Flow
13 MAPLD 2005/P200Chandrasekaran
FPGA Implementations and Results
Handel-C adds constructs to ANSI-C to enable DK to directly implement hardware Fully synthesizable HW programming language based on ANSI-C Implements C algorithm direct to optimized FPGA or outputs RTL
from C
Control statements(if, switch, case, etc.)
Integer ArithmeticFunctionsPointers
Basic types(Structures, Arrays etc.)
#define#include
ParallelismTiming
InterfacesClocks
Macro pre-processorRAM/ROM
Shared expressionCommunicationsHandel-C libraries
FP libraryBit manipulation
RecursionSide effects
Standard librariesMalloc
Software-only ANSI-C constructs
Majority of ANSI-C constructs supported by DK
Handel-CAdditions for hardware
14 MAPLD 2005/P200Chandrasekaran
FPGA Implementations and Results
8x8 Blocks
16 pixels
16
pix
els
Bank0
Bank1Bank2
Bank3
XCV2000E
Reduced Bit-Plane ME
Motion Vectors
15 MAPLD 2005/P200Chandrasekaran
The bit-plane values from the current frame are sent from the host to the SRAM Bank 0, and those from the previous frame are sent as 16 bit values to the SRAM Bank 1
The motion vectors are computed by the ME core and stored in the SRAM Bank 3
The host application reads the motion vectors and generates the predicted image in real time
FPGA Implementations and Results
16 MAPLD 2005/P200Chandrasekaran
The proposed architecture is area efficient, as the motion estimation is performed on a single bit plane, requiring compact logic and greatly reduced on-chip memory size
The architecture is efficient, compact and can be massively parallelised as the PE contains simple 1-bit XOR gates only
Memory access is greatly reduced due to use of single bit plane only, saving considerable amount of I/O power
FPGA Implementations and Results
17 MAPLD 2005/P200Chandrasekaran
FPGA Implementations and Results
This, along architectural level optimisations including parallelism and pipelining yield power efficient implementation
Implementation is carried out on the Celoxica RC1000 board equipped with Xilinx XCV2000E FPGA, as well as synthesised on Xilinx QPro Virtex-II FPGA
Results in terms of power/area/maximum frequency show that using reduced bit planes instead of full resolution images drastically reduces the FPGA resources used
18 MAPLD 2005/P200Chandrasekaran
FPGA Implementations and Results
Various performance metrics of the RBFSBM algorithm implemented on the Virtex-E and the QPro Virtex-II FPGAs
PerformanceMetrics
Virtex-E QPro Virtex-II
Area Occupied (slices) 1500 1488
Max Frequency (MHz) 43.57 76.247
Max Power (mW) 432.65 227.31
Energy/CIF Frame (mJ) 1.934 1.40
Max Throughput (FPS) 89.305 173
19 MAPLD 2005/P200Chandrasekaran
Conclusions
A reduced bit plane architecture for full search block matching has been proposed
The proposed architecture is low power, area efficient and suitable for VLSI/FPGA implementation
The developed architecture can be used for space applications such as onboard video compression, video conferencing, etc.
20 MAPLD 2005/P200Chandrasekaran
Future work and Acknowledgments
Develop Complete on-chip compression engine for real-time video compression, with applications ranging from onboard satellite compression to video conferencing
Explore the effect of Algorithmic, architectural and RTL level optimisations to minimise power consumption
Acknowledgments
Celoxica (Mr. Roger Gook) and EPSRC for supporting this work Celoxica (Mr. Roger Gook) and EPSRC for supporting this work