introduction - research-information.bris.ac.uk€¦  · web viewa large configuration space...

19
2967 Abstract — This work presents a flexible and scalable motion estimation processor capable of supporting the processing requirements of high definition (HD) video in the H.264 advanced video codec and suitable for FPGA implementation. The core can be programmed using a C-style syntax optimized to implement fast block matching algorithms. The development tools are used to compile the algorithm source code to the processor instruction set and to explore the processor configuration space. A large configuration space enables the designer to generate different processor microarchitectures varying the type and number of integer and fractional pel execution units together with other functional units. All these processor instantiations remain binary compatible so recompilations of the motion estimation algorithm are not required. Thanks to this optimization process it is possible to match the processing requirements of the selected motion estimation algorithm and options to the hardware microarchitecture leading to a very efficient implementation. Index Terms — Video coding, motion estimation, reconfigurable processor, H.264, FPGA. Manuscript received January 30, 2009. This work was supported by the UK EPSRC under grants EP/D011639/1 and EP/E061164/1. Jose Luis Nunez-Yanez, Atukem Nabina and George Vafiadis are with Bristol University, Department of Electronic Engineering Bristol, UK (phone: 0117 3315128; fax: 0117 954 5206; e-mail: j.l.nunez- [email protected], [email protected],[email protected] . Eddie Hung is with ECE at University of British Columbia, Vancouver, Canada; e_mail: [email protected]). I. INTRODUCTION he emergence of new advanced coding standards such as VC-1, AVS and especially H.264 with its multiple coding tools [1] have introduced new challenges during the motion estimation process used in inter-frame prediction. While previous standards such as MPEG-2 could only vary the search strategy H.264 adds the freedom of using multiple motion vector candidates, sub-pixel resolutions, multiple reference frames, multiple partition sizes and rate-distortion optimization as tools to optimize the inter-prediction process. The potential complexity introduced by these tools operating on large reference area sizes containing lengthy motion vectors makes the full-search approach which exhaustively tries each possible combination less attractive. A flexible, reconfigurable and programmable motion estimation processor such as the one proposed in this work is well poised to address these challenges by fitting the core microarchitecture to the inter-frame prediction tools and algorithm for the selected H.264 encoding configuration. The concept was briefly introduced in [2] and it is further developed and improved in this work. The paper is organized as follows. Section II reviews relevant work in the field of hardware architectures for motion estimation concentrating on reconfigurable and programmable solutions. Section III motivates the presented work showing the effects of different motion T Cogeneration of Fast Motion Estimation Processors and Algorithms for Advanced Video Coding Jose L. Nunez-Yanez, Atukem Nabina, Eddie Hung, George Vafiadis 1

Upload: others

Post on 04-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

Abstract — This work presents a flexible and scalable motion estimation processor capable of supporting the processing requirements of high definition (HD) video in the H.264 advanced video codec and suitable for FPGA implementation. The core can be programmed using a C-style syntax optimized to implement fast block matching algorithms. The development tools are used to compile the algorithm source code to the processor instruction set and to explore the processor configuration space. A large configuration space enables the designer to generate different processor microarchitectures varying the type and number of integer and fractional pel execution units together with other functional units. All these processor instantiations remain binary compatible so recompilations of the motion estimation algorithm are not required. Thanks to this optimization process it is possible to match the processing requirements of the selected motion estimation algorithm and options to the hardware microarchitecture leading to a very efficient implementation.

Index Terms — Video coding, motion estimation, reconfigurable processor, H.264, FPGA.

I. INTRODUCTION

he emergence of new advanced coding standards such as VC-1, AVS and especially H.264 with its multiple coding

tools [1] have introduced new challenges during the motion estimation process used in inter-frame prediction. While previous standards such as MPEG-2 could only vary the search strategy H.264 adds the freedom of using multiple motion vector candidates, sub-pixel resolutions, multiple reference frames, multiple partition sizes and rate-distortion optimization as tools to optimize the inter-prediction process. The potential complexity introduced by these tools operating on large reference area sizes containing lengthy motion vectors makes the full-search approach which exhaustively tries each possible combination less attractive. A flexible, reconfigurable and programmable motion estimation processor such as the one proposed in this work is well poised to address these challenges by fitting the core microarchitecture to the inter-frame prediction tools and algorithm for the selected H.264 encoding configuration. The concept was briefly introduced in [2] and it is further developed and improved in this work. The paper is organized as follows. Section II reviews relevant work in the field of hardware architectures for motion estimation concentrating on reconfigurable and

T

programmable solutions. Section III motivates the presented work showing the effects of different motion estimation options and algorithms in high-definition video coding. Section IV presents the programming model and tools developed to explore the software/hardware design space of advanced motion estimation. Section V describes the processor microarchitecture details and section VI analyses the complexity/performance/power of the proposed solution. Finally, section VII concludes this paper.

II.MOTION ESTIMATION HARDWARE REVIEW

Full-search algorithms have been the preferred option for hardware implementations due to their regular dataflow which makes them well suited to architectures using 1-D or 2-D systolic array principles with simple control and high hardware utilization. Full-search architectures implement SAD reuse strategies that makes them specially suited to support the variable block sizes used in H.264. By combining the results of smaller blocks into larger blocks only small increases in gate count are required over their conventional fixed-block counterparts with little bearing on its throughput, critical path, or memory bandwidth. On the other hand, the hardware requirements needed to obtain enough parallelism to check all the possible search points in real-time are very large. This will be even more challenging if large-search ranges, rate-distortion optimization and fractional-pel search are considered. A recent example of a high-performance integer-only full-search architecture is presented in []. This work considers a relatively large search range of 63×48 pixels and can vary the number of pixel processing units. A configuration using 16 pixel processing units can support 62 fps of 1920×1080 video resolution clocking at 200 MHz. Each pixel processing unit works in a different macroblock in parallel obtaining 41 motion vectors (all block sizes) in parallel. By working in 16 adjacent macroblocks of 16x16 pixels in parallel data reused can be exploited. The architecture needs around 154K LUTs implemented in a Virtex5 XCV5LX330

In an effort to reduce the complexity of motion search process architectures for fast ME algorithms have been proposed as seen in [9]. The challenges the designer faces in this case include unpredictable data flow, irregular memory access, low hardware utilization and sequential processing. Fast ME approaches use a number of techniques to reduce the

Manuscript received January 30, 2009. This work was supported by the UK EPSRC under grants EP/D011639/1 and EP/E061164/1. Jose Luis Nunez-Yanez, Atukem Nabina and George Vafiadis are with Bristol University, Department of Electronic Engineering Bristol, UK (phone: 0117

3315128; fax: 0117 954 5206; e-mail: [email protected], [email protected],[email protected]. Eddie Hung is with ECE at University of British Columbia, Vancouver, Canada; e_mail: [email protected]).

Cogeneration of Fast Motion Estimation Processors and Algorithms for Advanced Video

Coding

Jose L. Nunez-Yanez, Atukem Nabina, Eddie Hung, George Vafiadis

1

Page 2: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

number of search positions and this inevitably affects the regularity of the data flow, eliminating one of the key advantages that systolic arrays have: their inherent ability to exploit data locality for re-use. This is evident in the work done in [10] that compares a number of fast-motion algorithms mapped onto a systolic array and discovers that the memory bandwidth required does not scale at anywhere near the same rate as the gate count. A number of architectures have been proposed which follow the programmable approach by offering the flexibility of not having to define the algorithm at design time. The application specific instruction-set processor (ASIP) presented in [11] uses a specialized data path and a minimum instruction set similar to our own work. The instruction set consists of only 8 instructions operating on a RISC-like, register-register architecture designed for low-power devices. There is the flexibility to execute any arbitrary block matching algorithms and the basic SAD16 instruction computes the difference between two sets of sixteen pixels and in the proposed microarchitecture takes sixteen clock cycles to complete using a single 8-bit SAD unit. The implementation using a standard cell 0.13 μm ASIC technology shows that this processor enables real time motion estimation for QCIF, operating at just 12.5 MHz to achieve low power consumption. An FPGA implementation using a Virtex-II Pro device is also presented with a complexity of 2052 slices and a clock of 67 MHz. In this work scaling can be achieved by varying the width of the SADU (ALU-equivalent for calculating SADs) but due to its design, the maximum parallelism that can be achieved would be if the SAD for the entire row could be calculated in the minimum one clock cycle, in a 256-bit SIMD (Single Instruction Multiple Data) manner.

The programmable concept is taken a step further in [12]. This motion estimation core is also oriented to fast motion estimation implementation and supports sub-pixel interpolation and variable block sizes. The interpolation is done on-demand using a simplified non-standard filter which will cause a mismatch between the coder output and a standard-compliant decoder. The core uses a technique to terminate the calculation of the macroblock SAD when this value is larger than some previously calculated SAD but it does not include a Lagrangian-based RD optimization technique [13]. Scalability is limited since a single functional unit is available although a number of configuration options are available to match the architecture to the motion algorithm such as algorithm-specific instructions. The SAD instruction comparable to our own pattern instruction operates on a 16-pixel pair simultaneously and 16 instructions are needed to complete a macroblock search point taking up to 20 clock cycles. The processor uses 2127 slices in an implementation targeting a Virtex-II device with a maximum clock rate of 50 MHz. This implementation can sustain processing of 1024x750 frames at 30 frames per second. Xilinx has recently developed a processor capable of supporting high definition 720p at 50 frames per second, operating at 225 Mhz [14] in a Virtex-4 device with a throughput of 200,000 macroblocks per second. This Virtex-4 implementation uses a total of around 3000 LUTs, 30 DSP48s embedded blocks and 19 block-RAMs. The algorithm is fixed and based on a full search of a

4x3 region around 10 user-supplied initial predictors for a total of 120 candidate positions, chosen from a search area of 112x128 pixels. The core contains a total of 32 SAD engines which continuously compute for a given motion vector candidate the 12 search positions that surround it.

III. THE CASE FOR FAST MOTION ESTIMATION HARDWARE

. Most of the available literature indicates that full search

algorithms deliver the best performance in terms of PSNR and bit rate compared with fast motion estimation algorithms. However the research done in papers such as [19-20] suggest that a well-designed fast block matching algorithm not only can speed up the motion estimation process but also improve the rate-distortion performance in state of the art video coders such as H.264. The introduction of motion vector candidates as starting search points obtained from neighboring macroblocks and early termination techniques tend to produce smoother motion vectors with a smaller delta between the predicted motion vector and the selected motion vector. This in turn translates in fewer bits needed to code the motion vectors over a large range of macroblocks and it can produce better results than full search algorithms that check all the possible motion vectors available in the search range and select the one that minimizes a decision criteria such as the sum-of-absolute-differences (SAD). Additionally, the effective costing of the motion vector with a rate-distortion-optimization (RDO) Lagrangian technique is not generally considered in the full-search architectures although it can typically obtained a 10% reduction in bit rate for the same quality. Fig. 3,4 and 5 explore the rate-distortion performance of 1080p high definition sequences extracted from [17] with varying degrees of motion complexity (high in Crodwrun, medium in Pedestrian and low in sunflower). The algorithm selected is the popular hexagon-based fast search strategy for the integer-search followed by a diamond-based search for the fractional search as available in x.264.The search area has been increased to 112x128 pixels as used in our own core. The figures evaluate the performance of full-search at the integer-pel level as a reference. The full-search algorithm considered works in the traditional point of checking all the points and selection the one with the lower SAD without any Lagrangian optimization technique. It can be seen that it performs worst than the equivalent fast-search full-pel technique without Lagrangian. This is specially the case for the Pedestrian and Sunflower sequences that correspond to smooth object motion. These two sequences also show that enabling the Lagrangian optimization in the fast motion integer-pel only option is beneficial. This is not the case for the Crowdrun sequence that contains more local motion components that do not benefit from this optimization. Fractional-pel outperforms the integer-pel modes in all the sequences. Finally using sub-blocks offers little benefit for the Pedestrian and Sunflower since the motion complexity is lower an a single larger block can captured it correctly. From this analysis it can be concluded that different video sequences benefit differently from the different options available as part of motion estimation. Reconfigurable and programmable hardware can be used to better match the

2

Page 3: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

motion estimation algorithm and the hardware the algorithm runs on.

IV. INSTRUCTION SET AND PROGRAMMING MODEL

Fast motion estimation algorithms have not been standardized and multiple trade-offs between algorithm complexity and quality of the results can be made making a programmable architecture beneficial. The following sections present the microarchitecture and programming model of the hardware/software solution developed according to the principles of configurability and programmability which has been named LiquidMotion.

A. LiquidMotion Instruction Set ArchitectureThe instruction set should be able to express the inherent

parallelism available in the motion estimation algorithm in a simple way to minimize the overheads of instruction fetch and decode and keep the execution units of the core as busy as possible. The number of execution units available in the proposed processor vary depending on the implementation so it is important that binary compatibility between different hardware implementations is achieved so a program only needs to be compiled once and can be executed in any implementation. The instruction set architecture consists of a total of 9 different instructions and it is illustrated in Fig.5. There are two arithmetic instructions for integer and fractional pattern searches, a total of 6 control instructions that change the program flow and one mode instruction that sets the partition mode and reference frame to be used for the arithmetic instructions. The arithmetic instructions exploit the most obvious form of parallelism which is available at the search point level. For example in a simple small diamond pattern there are four points that can be calculated in parallel if enough execution resources have been implemented. The arithmetic instructions express this parallelism with two fields that identify the number of points used by the pattern and the position in the point memory where the offsets for that pattern are defined. The control unit can then execute the instruction with a parallelism level that ranges from issuing each of these points to a different execution unit in a fully parallel hardware configuration to issue each point to the same execution unit in a base hardware configuration. The same approach applies to fractional instructions. The set mode instruction is used to change the active partition mode and reference frame of the core and configures the internal control logic to operate with different address boundaries and data sources.

There are a total of 32 32-bit registers available. These registers include the command register, motion vector candidate registers, results registers and profiling registers. The motion vector candidate registers are used to store motion vectors supplied by the user from surrounding macroblocks or from macroblocks in different frames. These candidate motion vectors, together with their associated SAD values, are loaded into the current motion vector and current SAD registers with the issue of the set mode instruction and are used as the starting point for subsequent pattern instructions.

38

38.5

39

39.5

40

40.5

41

41.5

42

42.5

43

50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 90.0 95.0 100.0105.0110.0

PSN

R

Bit Rate (Mbits/s)

Crowdrun

Figure 3. High Complexity HD Motion estimation RD performance analysis

38

38.5

39

39.5

40

40.5

41

41.5

42

42.5

43

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Title

Bit Rate (Mbits/s)

Sunflower

Figure 4. Low Complexity HD Motion estimation RD performance analysis

38

38.5

39

39.5

40

40.5

41

41.5

42

42.5

43

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0

PSNR

Bit Rate (Mbits/s)

Pedestrian area

Fractional-pel search Integer-pel searchFractional-pel and all blocks Integer-pel only with Lagrangian dissableFull-search

Figure 2. Medium Complexity HD Motion estimation RD performance analysis

3

Page 4: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

Integer pattern instruction

0010 Winner field immediate8

Unconditional jump to label

0011 immediate8

15 8 7

unused

Conditional jump to label(if winner field = winner id the jump to inmediate8)

winner id = 0 no winner in pattern else ids the winner execution unit

0100 unused immediate8

Compare(if less than set condition bit)

Compare(if greater than set condition bit)

Compare(if equal set condition bit)

1000 MVC

Set mode for MVcandidate/reference frame/partition

ReferenceFrame

0000 Pattern address Number of points

16

0101 reg immediate12

0110 reg immediate12

0111 reg immediate12

Op code Field A Field B

Fractional pattern instruction

0001 Pattern address Number of points

7 315

Unused

11Partition

mode

716 15 8

Conditional jump to label (ifcondition bit set jump to label)

Figure 5. LiquidMotion ISA

The core has no instructions to access external memory, relying instead on an external DMA engine to move the reference frame data and current macro block data before processing starts. This external DMA engine moves an initial 7x8 macroblock search area at the beginning of each row for a search range of 112x128 pixels. For the remaining macroblocks in each row, only the newest column needs to be loaded and the loading of the new macroblock column can take place in parallel with data processing as explained in section V.C. Once the input data is ready, processing can start by writing to the command register. Advanced motion estimation techniques such as the adaptive thresholds used in the UMH [21] or PMVFAST algorithms can be implemented modifying the program memory contents directly by inserting modified immediate field contents into the compare instructions.

B. LiquidMotion Programming Model and Design Flow The processor offers a simple programming model so that a motion estimation algorithm programmer can access the functionality of the hardware without detailed knowledge of the microarchitecture. The toolset is composed by a compiler, a cycle accurate simulator and analysis functions and it enables the programmer to test different motion search techniques before deciding on the one that obtains the required quality of results in terms of rate-distortion performance and throughput in terms of macroblocks or frames per second. At this point the programmer can instruct the tools to generate an RTL configuration file for the processor. Commercial synthesis tools such as Xilinx ISE or Synplify can then be

used to process this configuration file together with the LiquidMotion RTL hardware library and generate a hardware netlist and FPGA bitstream with the right number and type of execution units matching the software requirements.

RTL Component Library

High level ME algorithm code SharpEye

Compiler

Assembly codeSharpEye Assembler/Linker

Program Binary Point Binary

Cycle Accurate Simulator/

ConfiguratorConstraints energy/

throughput/quality/area

Constraints met?

Number and type of functional units

(Integer and fractional pel, Lagrangian,

motion vector candidates, etc)

Processor bitstream

NoYes

Generate configuration RTL

file

Standard Synthesis/

Place&route FPGA tools

New Hardware configuration

New ME algorithm

Figure 5. LiquidMotion Design Flow

This design flow is illustrate in Fig. 5. This scalable architecture can be easily programmed using an intuitive C-style language called EstimoC. EstimoC is a high level language, powerful enough to express a broad range of motion estimation algorithms in a natural way. The EstimoC code is written in the embedded editor or any other compatible editor and is interpreted by the EstimoC compiler. The language has a natural syntax with elements from C and with special structures for the development of motion estimation algorithms. Typical constructs such as for loops, if-else, while loops, etc are supported. The algorithm designer can use these constructs to create arbitrary block-matching motion estimation algorithms ranging from the classical full search to advanced algorithms such us UMH. Part of the language is dedicated to the preprocessor and other parts are for the core decoding unit. The preprocessor is a crucial part of the compiler because it provides syntax facilities for the development sophisticated algorithms. For example, EstimoC provides two ways to specify the search patterns: using a static pattern specification as in pattern(hexagon) {pattern instructions} or using the dynamic pattern generation. In this second case the programmer writes a sequence of simple check instructions in the form check(x,y); followed by the update; syntax element. A simple program example and a section of the compiler output with all the loops unrolled are shown in Fig. 6. The algorithm corresponds to a 4-point diamond pattern followed by a full-search fractional-pel refinement which also illustrates that it is possible to implement exhaustive search approaches if they are required. The example starts setting an initial step size of 8 that defines the size of the diamond. An initial check is done at the center

4

Page 5: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

point (defined by the motion vector loaded in the motion vector candidate register or zero if none available) and the 4-point diamond surrounding it. This will result in a single integer-pattern instruction with 5 points (instruction zero in the sample code). Then a number of diamond steps are conducted reducing the step size until the step size is smaller than 1 that corresponds to fractional searches. Each 4-point diamond will generate a single check instruction. Finally a small full search is conducted with the two for loops that will result in a single fractional instruction with a total of 25 points (0-.5, -0.25, 0, 0.25, 0.5) for the i and j indexes (instruction 31 in the sample code). The example program also shows a specific if-break syntax that is used to terminate the search early as described in section C and corresponds to instruction opcode 2 in the sample code.

S = 8; // Initial step size

check(0, 0);check(0, S);check(0, -S);check(S, 0);check(-S, 0);update;

do{ S = S / 2; for(i = 0 to 4 step 1) { check(0, S); check(0, -S); check(S, 0); check(-S, 0); update; #if( WINID == 0 ) #break; }} while( S > 1);

for(i = -0.5 to 0.5 step 0.25) for(j = -0.5 to 0.5 step 0.25) check(i, j);update;

0 0 05 00 chk NumPoints: 5 startAddr: 01 0 04 05 chk NumPoints: 4 startAddr: 52 2 00 0B chkjmp WIN: 0 goto: 113 0 04 05 chk NumPoints: 4 startAddr: 5 ……………….11 0 04 0A chk NumPoints: 4 startAddr: 912 2 00 15 chkjmp WIN: 0 goto: 21 ………………..21 0 04 0D chk NumPoints: 4 startAddr: 1322 2 00 1F chkjmp WIN: 0 goto: 31 ……………….31 1 19 11 chkfr NumPoints: 25 startAddr: 17

Integer check pattern instruction

Conditional jump instruction

Fractional check pattern instruction

Figure 6. LiquidMotion programming example and compiler output

All of the check constructs between update constructs result in a single integer or fractional pattern instruction. The compiler processes this source code and generates two binary files. The first file called program_memory contains the program instructions themselves, the second file called point_memory contains the x and y offsets off the basic search pattern (e.g. [-1,0], [1,0], [0,1], [0,-1] for the diamond search) that will be computed with the current motion vector candidate and identify the location of each new search point to be checked.

C.Early termination and Search-point duplication avoidance implementation

Early termination is a very important feature used to speed up execution in fast motion estimation algorithms. Typically if a pattern fails to improve the SAD of the previous iteration, the algorithm terminates the current search loop. To implement this technique each completing check pattern instruction sets a best_eu register indicating which search

point has improved upon the current cost. This register is set to zero before each instruction starts executing so the value of the best_eu register at the end of execution indicates if the instruction has improved the cost value (best_eu different from zero) and if so which search point has achieved this improvement. The conditional jump instruction checks this register and changes the execution flow as required. The same hardware can be used to support a technique to avoid searching duplicated points by coding optimized sub-patterns in software. For example, in a hexagon search pattern the first pattern contains six different points but subsequent patterns will only add three new points to the search sequence. To avoid checking the same point more than once the best_eu register can be checked to identify the winning search point and this information can be used by the hardware to decide which instruction to execute next. For this optimization to work for the hexagon case the program needs to be extended to contain a total of one full pattern sequence and six short patterns sequences. The complexity of identifying possible duplicated search points and avoiding them is built into the compiler so the algorithm designer does not need to get involved in this process and this also helps to keep the hardware simple.

V. PROCESSOR MICROARCHITECTURE

The microarchitecture of two configurations are illustrated in Fig.7 and Fig.8. Fig.7 corresponds to the base-configuration with a single integer-pel execution unit while Fig.8 corresponds to a complex configuration with 4 integer-pel execution units, 2 fractional-pel execution units and one interpolation execution unit. One integer-pel pipeline must always be present as shown in Fig.7 to generate a valid processor configuration but the others units are optional, and are configured at compile time. Additionally to the number of fractional and integer execution units the hardware includes support for other motion estimation options as shown in Table 1. Notice that inpedent state machines are used in the control unit to support variable block-sizes. The set mode instruction can be used to set the core for a particular partition. Partitions are calculate sequentially one after another.

A. Integer-pel execution units (IPEU). Each functional unit uses a 64-bit wide word and a deep

pipeline to achieve a high throughput. All the accesses to reference and macroblock memory are done through 64-bit wide data buses and the SAD engine also operates on 64-bit data in parallel. The memory is organized in 64-bit words and typically all accesses are unaligned, since they refer to macroblocks that start in any position inside this word. By performing 64-bit read accesses in parallel to two memory blocks, the desired 64-bits inside the two words can be selected inside the vector alignment unit. The number of integer-pel execution units is configurable from a minimum of one to a maximum of 16 and generally limited by the available resources in the technology selected for implementation. Each execution unit has its own copy of the point memory and processes 64-bits of data in parallel with the rest of the execution units. The point memories are 256x16 in size and contain the x and y offsets of the search patterns.

5

Page 6: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

Configuration option Options available Complexity In Virtex-5 LUTs

Complexity memory Virtex-5 BRAMS

Base processor with one IPEU, one reference frame and 16x16 block size

N/A 1464 9

Number integer-pel execution units

ipeu from 1 to 16 +(ipeu-1) * 1015 +(ipeu-1)*4

Number of fractional-pel execution units

fpeu from 0 to 16 +4057 + fpeu*1104 +1+fpeu*9

Additional partition sizes supported

8x8 (8x8, 16x8,8x16) or 4x4 (4x8, 8x4, 4x4)

+75 +160

0

Motion vector candidates supported

Enable or Disable +132 0

Langrangian optimization Enable or Disable +133 0 Number of additional reference frames

1 +20 +4

Table 1. LiquidMotion configuration options

Fetch, Decode,

IssueReferenceMemory

Program Memory

SAD Selector

128-bit Reference Vector

64-bit Current Vector

16-bit current

sad8-bit eu id

16-bit best sad

8-bit best eu id

8-bit Fetch

addresses

20-bit instructions 12-bit reference

addresses

Point Memory

8-bit point addresses

Address Calculator

16-bit current motion vector

64-bit Reference

Vector

Macroblock MemoryVector

Alignment Unit

SAD

SADAccumulat

or and control

64-bit SAD Vector

Fetch, Decode, Issue

ReferenceMemory

Program Memory

Macroblock Memory

COST Selector

ReferenceMemory

ReferenceMemory

ReferenceMemory

128-bit Reference Vector 128-bit Reference Vector

128-bit Reference Vector

64-bit Current Vector

64-bit Current Vector

64-bit Current Vector

64-bit Current Vector

16-bit current sad 8-bit eu id

8-bit eu id16-bit current sad 16-bit

current sad 8-bit eu id

16-bit current sad 8-bit eu id

16-bit best sad 8-bit best eu

id

8-bit Fetch addresses

20-bit instructions

8-bit pattern addresses

12-bit referenceaddresses

Point Memory

Point Memory

Point Memory

Point Memory

8-bit point addresses

Address Calculator

Address Calculator

Address Calculator

Address Calculator

12-bit referenceaddresses

12-bit referenceaddresses

12-bit referenceaddresses

8-bit point addresses

8-bit point addresses

16-bit current motion vector

Vector Alignment Unit

Vector Alignment Unit

Vector Alignment Unit

Vector Alignment Unit

COSTAccumulat

or and control

SAD SAD SAD SAD

COSTAccumulat

or and control

COSTAccumulat

or and control

COSTAccumulat

or and control

64-bit SAD Vector

64-bit SAD Vector

64-bit SAD Vector

64-bit SAD Vector

Half-pel Reference Memory

COST Selector

Half-pel Interpolation

Quarter-pel Interpolator

16-bit best cost

16-bit current cost

Macroblock Memory

Vector Alignment Unit

128-bit hp interpolated

pixels

64-bit interpol

ated vector

Vector Alignment Unit

Vector Alignment Unit

Vector Alignment Unit

Half-pel Reference Memory

Quarter-pel Interpolator

DIF DIF

COST Accumulat

or and control

COST Accumulat

or and control

64-bit DIF

vector

16-bit current cost

Address Calculator

MV cost

Motion Vector cost

MVP QP MV

8-bit referenceaddresses

8-bit point addresses

Address Calculator

64-bit reference

Vector

16-bit best sad and 8-

bit winner id

For example a typical diamond search pattern with a radius of 1 will use 4 positions in the point memory with values [-1,0],

Figure 8. Microarchitecture with a total of six execution units

Figure 7. Microarchitecture with a single execution unit

6

Page 7: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

[0,-1], [1,0], [0, 1]. Any pattern can be specified in this way and multiple instructions specifying the same pattern can point to the same position in the point memory saving memory resources. Each integer-pel execution unit receives an incremented address for the point memory so each of them can compute the SAD for a different search point corresponding to the same pattern. This means that the optimal number of integer-pel execution units for a diamond search pattern is four, and for the hexagon pattern six. Further optimization to avoid searching duplicated patterns can halve the number of search points for many regular patterns. In algorithms which combine different search patterns, such as UMH, a compromise can be found to optimize the hardware and software components. This illustrates the idea that the hardware configuration and the software motion estimation algorithm can be optimized together to generate different processors depending on the software algorithm to be deployed.

B. Fractional-pel Execution Unit (FPEU) and Interpolation Execution Unit (IEU).

The engine supports half and quarter pel motion estimation thanks to a half-pel interpolator execution unit and specifically designed fractional-pel execution units. The number of half-pel interpolation execution units is limited to one but the number of fractional-pel execution units can also be configured at compile time. The IEU interpolates the 20x20 pixel area that contains the 16x16 macroblock corresponding to the winning integer motion vector. The interpolation hardware is cycled 3 times to calculate first the horizontal pixels then the vertical pixels and finally the diagonal pixels. The IEU calculates the half pels through a 6-tap filter as defined in the H.264 standard. The IEU has a total of 8 systolic 1-D interpolation processors with 6 processing elements each. The objective is to balance the internal memory bandwidth with the processing power so in each cycle a total of 8 valid pixels are presented to one interpolator. The interpolator starts processing these 8 pixels producing one new half-pel sample after each clock cycle. In parallel with the completion of 1-D interpolation of the first 8-pixel vector, the memory has already been read another 7 times and its output assigned to the other 7 interpolators. The data read during memory cycle 9 can then be assigned back to the first interpolator obtaining high hardware utilization. The horizontally interpolated area contains enough pixels for the diagonal interpolation to also complete successfully. A total of 24 rows with 24 bytes each are read. Each interpolator is enabled 9 times so that a total of 72 8-byte vectors are processed. Due to the effects of filling and emptying the systolic pipeline before the half-pel samples are available, a total of 141 clock cycles are needed to complete half-pel horizontal interpolation. During this time, the integer pipeline is stalled, since the memory ports for the reference memory are in use. Once horizontal interpolation completes, and in parallel with the calculation of the vertical and diagonal half-pel pixels and the fractional pel motion estimation, the processing of the next macroblock or partition can start in the integer-pel execution units. Completion of the vertical and diagonal pixel interpolation takes a further 170

clock cycles after which the motion estimation using the fractional pels can start. Quarter-pel interpolation is done on-the-fly as required simply by reading the data from two of the four memories containing the half and full pel positions, and averaging according to the H.264 standard. The fractional pipeline is as fast as the integer pipeline, requiring the same number of cycles to compute each search position as explained in section VI.

C.Reference memory organization

The implemented reference memory can accommodate a search area with a width of 128 pixels. The limitation of the horizontal search range to 112 pixels leaves a 16-pixel wide memory area available to be reloaded with a new column for the next macroblock in parallel with the processing of the current macrobloc using a shifting window technique. The shift window means that the reference addresses are offset so reads are not performed on the memory area being loaded with a new column of reference pixels for the next macroblock. The implementation of the reference area in Xilinx V5 Block-RAMs uses a total 4 blocks RAMs. Each Block-RAM is organized with 1024 words and 4-bytes per word and in a dual-port configuration. Fig. 8 shows a simplified view of the reference memory organization. The key feature is that the 8-pixel words that form the reference area are stored in an interleaved organization in the BlockRAMs. For example the first row of the first 16x16 macroblock is formed by words 0 and 1. Word 0 is stored in BRAMs 1 and 2 while word 1 is stored in BRAMs 3 and 4 as shown on the left of Fig.8. The less significant bit of the address is used to activate the reading of the adequate BRAMs. Since a motion vector can point to any location in this reference window the accesses are generally misaligned and, for example, the last 3-bytes from the word read in BlockRAMs 1/2 must be concatenated with the first 5-bytes from the word read in BlockRAMs 3/4 to form 64 bits of valid data. Notice that if for example the motion vector points to the middle of memory word 1 then a few bytes from memory word 2 will also be needed to formed 64-bits of valid data.. In this case the address must be incremented by one to access the right location for memory word 2 (second position in the BRAMs 1/2). The effect of the memory interleaving technique is that the BlockRAMs always have one memory port free. The free port can be used to load new reference data for the next macroblock in parallel with the processing of the current macroblock. This is very important since if processing and loading of new data must be done in sequence performance will typically half. The simultaneous reading and writing means that next macroblock data is being loaded by an external DMA engine while the current macroblock is processed in parallel to mask the overheads effects of limited bus bandwidth. In our prototype the bus width is 64-bits so the DMA engine can load a new 64-bit word in each clock cycle. A new column of 8 macroblocks (2048 bytes) can then be loaded in 256 clock cycles.

7

Page 8: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

BRAM1 (1024x32)

BRAM2 (1024x32)

Read Port

Write Port / Read Port

Address

Data

Address

Data10

64

10

64PortA

PortB

BRAM3 (1024x32)

BRAM4 (1024x32)

Read Port Write

Port / Read Port

Address

Data

Address

Data

10

64

10

64PortA

PortB

+1

Read Address

Write EnableWE

0

WE

0

Write Enable

Write Address

4x4,8x4 read port

16x16, 16x8, 8x16, 8x8, 4x8

read port

+16

+164x4/4x8 enable

10

64

64

64

64

10

Reference Data In

64

Reference Data Out

Reference Data Out

Read Address(0)

Read Address(0)

Read Address(0)

Read Address(0)

4x4/4x8 enable

0 1 2 15…...

16 17 18 31…...

…...

8-pixels

128-pixels

0

2

1

3

…...

30 31

BRAM1 (1024x32)

BRAM 2(1024x32)

BRAM3 (1024x32)

BRAM4 (1024x32)

8-pixels 8-pixels

2046 2047

128rows

1024words

Reference memory organization

Effective Search area (112 x128 pixels)

Reference memory architecture

VI.HARDWARE PERFORMANCE EVALUATION AND IMPLEMENTATION

For the implementation we have selected the Virtex-5 LX110T device included in the XUPV5 development platform. This device offers a high level of density inside the Virtex-5 family and can be considered main stream being fabricated using 65 nm CMOS technology.

D.Performance/complexity analysis. The results of implementing the processor with different numbers and types of execution units are illustrated in table 1.The basic configuration is small using only 2% of the available logic resources and 6% of the memory blocks.

Virtex-5 LX110T (XUP V5 board)

Configuration (number of execution units)

Virtex-5 Slice LUTs used / Slice LUTs available

Virtex-5 Memory blocks used/ Memory blocks available

Critical path (ns)

1 IPEU/ 0 FPEU

1,464/ 69,120 (2%) 9/148 (6%) 4.551

2 IPEU/ 0 FPEU

2,479/69,120 (3%) 13/148 (8%) 4.420

3 IPEU/ 0 FPEU

3,461/69,120 (5%) 18/148 (12%) 4.620

1 IPEU/ 1 FPEU

6,625/69,120 (9%) 18/148 (12%) 4.695

2 IPEU/ 1 FPEU

7,567/69,120 (11%) 23/148 (15%) 4.470

Each new execution unit adds around 1000 V5 LUTs and 4 embedded memory blocks to the complexity. The fractional and integer execution units have been carefully pipelined and all the configurations can achieve a clock rate of 200 MHz in this part. To obtain a performance value in terms of macroblocks per second is not as straight forward as in full search hardware that always computes the same number of SADs for each macroblock. In this case the amount of motion in the video sequence, the type of algorithm and the hardware implementation vary the number of macroblocks per second that the engine can processed. The cycle accurate simulator part of the toolset has been used to measure the performance of the core processing the same high-definition files introduced in section 3. The performance values obtained from the cycle accurate simulator have been verified against a prototype implementation of the system using the XUPV5 board. .Overall the microarchitecture always uses 33 cycles per search point although there is an overhead of 11 clock cycles needed to empty the integer pipeline before the best motion vector can be found in each pattern iteration and the next pattern started from the current winning position. The microarchitecture stops an execution unit if the current SAD calculation becomes larger than the cost obtained during a previous calculation to save power but it does not try to start the next search point earlier. The main reason why this optimization is not used can be explained as follows. Since the core uses multiple execution units it is very important that all the execution units are maintained synchronized so that a single control unit can issue the same control signals to all the execution units. Execution units starting at different clock cycles will invalidate this requirement.

Integer-pel performance is evaluated using three different fast motion estimation algorithms: diamond, hexagon and UMH

Figure 8. Reference memory internal organization

Table 1. Processor complexity

8

Page 9: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

(Uneven Multi-hexagon Cross Search) all of them followed by a 8-point square refinement as implemented in the x.264 codec. Figs. 9 to 11 show the performance in terms of frames per seconds as the number of integer execution units changes for different minimum sub-partitions. The 8x8 mode considers the 16x16, 8x16, 16x8 and 8x8 partitions while the mode 4x4 considers all the partitions. As the number of partitions consider increases performance decreases since the core must compute one partition at a time. It is not possible to reuse partition results and calculate them in parallel for the fast motion estimation algorithms considered since each partition will follow a potentially different search direction. It is important to notice that not all the partitions are checked. The inter-mode selection algorithm part of the x.264 codec selects which sub-partitions to test. For example, if the 8x8 partition has not improved over 16x16 partition then 4x4 will not be considered. The figures show that more complex algorithms show a better scalability with the available number of execution units. For example, a diamond search pattern optimal configuration includes four IPEUs although in these experiments performance increases for configurations with more than 4 IPEUs due to the presence of the final square refinement that includes 8-points in its search pattern. It is also important to notice that a configuration with three IPEUs will need the same number of cycles as for the two IPEUs case for the diamond-search. The reason for this is that whilst the first iteration will enabled all three IPEUs, a second iteration will still be required to complete the pattern instruction, when only one IPEU will be enabled.

0153045607590

105120135150165180195210225240255270285300

1 2 4 8 16

fram

es/s

econ

d

Number of IPEU

Pedestrian area

dia 16x16 dia 8x8 dia 4x4 hex 16x16 hex 8x8hex 4x4 UMH 16x16 UMH 8x8 UMH 4x4

0153045607590

105120135150165180195210225240255270285300

1 2 4 8 16

fram

es/s

econ

d

Number of IPEU

Sunflower

dia 16x16 dia 8x8 dia 4x4 hex 16x16 hex 8x8hex 4x4 UMH 16x16 UMH 8x8 UMH 4x4

0

15

30

45

60

75

90

105

120

135

150

165

180

195

210

225

240

255

270

285

300

1 2 4 8 16

fram

es/s

econ

d

Number of IPEU

Crowdrun

dia 16x16 dia 8x8 dia 4x4 hex 16x16 hex 8x8

hex 4x4 UMH 16x16 UMH 8x8 UMH 4x4

Figures 9 to 11 also showed that the simpler motion available in video sequences such as Sunflower and Pedestrian result in higher frame rates. This could be exploited in the hardware by lowering the clock frequency to maintain a constant frame rate in a real application. Also the complex motion present in Crowdrun makes the probability of selecting the smaller sub-blocks much higher and increases the impact on performance of using these sub-blocks. For example to maintain a frame rate of 30 frames per second over the Crowdrun sequence when all the block sizes are used 16 IPEUs are needed as shown in Fig.11. Another form of parallelism not described in this paper but certainly possible will be a multi-core implementation. In this case some ME processors could be dedicated to run particular sub-blocks and only activated if needed. This will enable the further scaling of the presented

Figure 9. Analysis of iInteger-pel performance in Pedestrian sequence

Figure 10. Analysis of iInteger-pel performance in Sunflower sequence

Figure 11. Analysis of iInteger-pel performance in Crowdrun sequence

9

Page 10: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

architecture to higher frame rates for complex algorithms. The current microarchitecture can run both the integer-pel and fractional-pel in parallel. To be able to obtain the same level of fractional and integer-pel performance each fractional pel execution unit needs two alignment units due to the fact that in order to perform quarter-pel interpolation two half pel data words need to be read and aligned. The complex part of executing the fractional-pel refinement involves the half-pel interpolation using the standard 6-tap filter. In the current microarchitecture this interpolation needs to complete before the fractional-pel search can start and the interpolator needs around 300 clock cycles to calculate the horizontal, vertical and diagonal pixels. Figs. 12 to 14 evaluate the performance of the fractional-pel searches using three fractional motion estimation: diamond, hexagon and square search. Fractional search does not require complex algorithms since the search area is limited to 20x20 pixels. This is the 16x16 pixel area that corresponds to the winning integer macroblock extended with two pixels in each side. In all the cases we consider a search loop formed by two half-pel checks followed by two quarter-pel checks. This follows the same approach as used in the x.264 codec. Also, sub-partitions are processed in a similar way as done in the low complexity mode of x.264: the fractional search refinement is only done on the best partition after the integer-search completes. This option is taken to maintain interpolation complexity low. The alternative of performing a fractional refinement over each possible partition will need a muti-core implementation since the single interpolator available in the microarchitecture will not be able to cope. Similarly to the integer-pel search the figures show that simpler motion sequences translate in higher performance as expected. In this case we can also observe that the scalability of the fractional-pel search performance with the number of FPEU is more limited that in the integer-pel case. The reason for this is the need for the half-pel interpolation stage before the search can start that always needs a constant number of clock cycles independentely of how many FPEUs are available.

0

15

30

45

60

75

90

105

120

135

150

1 2 4 8 16

fram

es/s

econ

d

Number of FPEU

Pedestrian area

dia 16x16 dia 8x8 dia 4x4 hex 16x16 hex 8x8hex 4x4 square 16x16 square 8x8 square 4x4

0

15

30

45

60

75

90

105

120

135

150

1 2 4 8 16

fram

es/s

econ

d

Number of FPEU

Sunflower

dia 16x16 dia 8x8 dia 4x4 hex 16x16 hex 8x8hex 4x4 square 16x16 square 8x8 square 4x4

0

15

30

45

60

75

90

105

120

135

150

1 2 4 8 16

fram

es/s

econ

d

Number of FPEU

Crowdrun

dia 16x16 dia 8x8 dia 4x4 hex 16x16 hex 8x8hex 4x4 square 16x16 square 8x8 square 4x4

Finally, Table 3 compares the performance and complexity figures of the base configuration of the LiquidMotion processor against the ASIP cores proposed in [11] and [12] in terms of performance complexity. The figures measured in the general purpose P4 processor with all assembly optimizations enabled are also presented as a reference although the power consumption and cost of this general purpose processor are not suitable for the embedded applications this works targets. These types of comparisons are difficult since the features of each implementation vary. For example our base configuration does not support fractional pel searches and the addition of the interpolator and fractional pel execution unit in parallel with the integer pel execution unit increases complexity by a factor

Figure 14. Analysis of Fractional-pel performance in Crowdrun sequence

Figure 13. Analysis of Fractional-pel performance in Sunflower sequence

Figure 12. Analysis of Fractional-pel performance in Pedestrian sequence

10

Page 11: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

of 3. The core presented in [12] does support fractional pel searches although with a non-standard interpolator and both searches must run sequentially. Overall Table 3 shows that our core offers a similar level of integer performance for the diamond search algorithm to the ASIP develop in [12] with one execution unit and this can be almost double if the configuration instantiates two execution units as shown in the last row. For these experiments our core was retargeted to a Virtex-II device since this is the technology used in [11] and [12] to obtain a fair comparison. The pipeline of the proposed solution can clock at double the frequency as shown in the table and this helps to justify why our solution with a single execution unit can support 1080p HD formats while the solution presented in [12] is limited to 720p HD formats. The measurements of cycles per macroblock were obtained processing the same CIF sequences as used in [12].

FPGA clock Memory (MHz, Virtex-

II)(BRAMS)

Intel P4 assembly

~3,000 N/A N/A N/A

Dias et al. [11] 4,532 2,052 67 4(external reference area)

Babionitakis et al. [12]

660 2,127 50 11 (1 reference area of 48x48

pixels)Proposed with one integer-pel execution unit

510 1,231 125 21 (2 reference areas of 112x128

pixels)

Proposed with two integer-pel execution units

287 2,051 125 38(2 reference areas of 112x128

pixels)

Cycles per MB (Diamond

search)

FPGA Complexity

(Slices)

Approach

Table 3. Performance/complexity comparison

The diamond search corresponds to the implementation available in x.264 that includes up to 8 diamond interactions followed by a square refinement using a single reference frame and a single macroblock size (16x16).

B. Power analysis.

Power is a major consideration in hardware design so it is important to investigate how effective is the core from a power efficiency point of view. Unfortunately no power results have been reported in [10] and [11] for the FPGA implementations. In any case most of the literature available reporting power consumption in FPGAs rely on the tools provided by the vendors. The standard approach is to use a tool such as Xilinx Xpower together with a VCD activity file obtained from simulating the netlist backannotated with timing information. This should accurately captured the logic gliches largely responsible for dynamic power consumption together with the switching behavior of flip-flops and LUTs. This flow applied to our core translates into unreasonable running times or a low level of confidence regarding the power results due to only a portion of the signals/logic being activated if using short simulation runs. A timing simulation of around 1000 ns that only contents 100 clock cycles of activity with a 100 MHz clock rate results on Xpower needing more than 12 hours to complete the analysis in a P4 computer due to the complexity of the core. A more accurate approach involves measuring the amount of power consumed by the chip when deployed and

this is the method we have used in this work. The core is deployed as part of a SoC using a modified Xilinx XUP V5 board with isolated vcore power supply connected to a purposely designed voltage regulator. For the analysis we have clock gated the ME processor to be able to isolate the power consumed by the ME core from the power of the rest of SoC map in the FPGA. The SoC uses a soft core processor to move data from external DDR memories to the internal ME memories using the AMBA bus. In this experiment the movement of data is done initially gating the clock of the ME core and in the second run with the clock running and the real motion estimation process done. The difference between these two measures corresponds to the dynamic power of the ME core. Fig. 15 shows the dynamic power of the ME cores with one and two integer-pel cores. As expected power increases linearly with core frequency and it is proportional to the core complexity.

0

10

20

30

40

50

60

70

80

90

100

18.75 21.88 25.00 28.12 31.25 34.38 37.50 40.62 43.75 46.88 50.00

Pow

er (m

W)

Frequency (MHz)

Dynamic Power of ME engines

ME 1IU

ME 2IU

To consider static power in FPGA devices it is possible to make the distinction between the configured and unconfigured states. In the unconfigured state the bitstream has not been loaded and the FPGA fabric is set to a default low-leakage state as described in [x] with

Initially the reconfigurable region is left in a blank state (unconfigured) but the process of moving data from external DDR memories to th

Column 2 in Table 4 shows the power measured after configuring the static region with the SoC but keeping the reconfigurable region empty. The value corresponding to Frequency 0 shows the static power consumption of the FPGA. It can be observed that the static power consumption is the main cause of power consumption in the device. The second column shows the power after the region has been configured with the motion estimation core but both SoC and ME remain in an idle state. Power increases from 29 mw to 78 mw depending on the clock frequency. The increase in power with the clocks is expected since although no useful work is being done the activation of the clocks will increase the switching activity of the logic cells, digital clock managers and another logic present in the chip. It is however remarkable the large increase in power resulting from configuring the region. This suggest that the unconfigured state is much more power efficient than the configured state and that if a region in the FPGA is not going to be used for some time it could be unconfigured to save power. Column 3 correspond to the

11

Page 12: INTRODUCTION - research-information.bris.ac.uk€¦  · Web viewA large configuration space enables the designer to generate different processor microarchitectures varying the type

2967

region configured with the ME core but only the SoC running. In this case the SoC processor writes reference and current macroblock data to the ME memories but it does not activate the core. Finally column 4 is equivalent to column 3 but in this case the SoC processor activates the ME core to calculate the motion vectors as defined by the motion search algorithm. A diamond search is used for these experiments and the difference between column 3 and column 4 is the dynamic power of the running motion estimation core that measures around 22 mw for the 50 MHz clock. The total dynamic power can be estimated by adding the power consumed by the core with the clocks running but not doing any useful work which can be obtained from the difference between columns 2 and 1 for the 50 MHz once the value corresponding to static power at 0 MHz has been subtracted ( (576-454) – (520 – 425) = 27 mw). This translates in a total dynamic power of 49 mw. The table mex2 shows equivalent data when the ME processor is configured with two integer execution units with a total dynamic power of 74 mw.

Table 4. Power analysis

The value of the static power consumed is very high but this corresponds to the whole device and not just the portion of the device being used by the core. Since the core with one integer execution unit represents around 8% of the total device we can estimate that the ratio of static power that corresponds to the core itself is approximately 63 mw (34 mw unconfigured + 29 mw after configuration). The mex2 version uses approximately 13% of the device. The static power for the mex2 version is approximately 98 mw (55 mw unconfigured + 43 mw after configuration). In both cases static power is higher than dynamic power. Techniques such as the voltage scaling and error correction approach used in [22] for motion estimation could also be added to the execution units in this work to reduce both the static and dynamic power consumptions.

VII.CONCLUSION

The main features of the presented processor are the support of arbitrary fast motion estimation algorithms, the seamless integration of fractional and integer pel support, the availability of a software toolset to ease the development of new motion estimation algorithms and processors and the description of a scalable, configurable architecture with a number of execution units determined by the algorithm and throughput requirements. The combination of these features constitutes a significant advancement compared with the work reviewed in section two. For the case of traditional full search hardware the presented core scales well to large search ranges without linear increases in hardware resources and consequently power consumption. The power analysis based on measured data has shown the large effect of static power. The power values have been added to the cycle accurate simulator part of the toolset (available at http://sharpeye.borelspace.com/) which can then be used to configure the processor according to power, performance and complexity constraints.

[1] Ostermann, J., Bormans, J., List, P., Marpe, D., Narroschke, M., Pereira, F., Stockhammer, T. and Wedi, T., “Video coding with H.264/AVC: tools, performance and complexity”. IEEE Circuits Syst. Mag. v4. pp. 7-28.

[2] Nunez-Yanez, J.L.; Hung, E.; Chouliaras, V., 'A configurable and programmable motion estimation processor for the H.264 video codec,' FPL 2008. International Conference on , vol., no., pp.149-154, 8-10 Sept. 2008

[3] Huang, Y.-W., Wang, T.-C., Hsieh, B.-Y., Chen L.-G. “Hardware Architecture Design for Variable Block Size Motion Estimation in MPEG-4 AVC/JVT/ITU-T H.264”. ISCAS. May 2003.

[4] Ching-Yeh Chen; Shao-Yi Chien; Yu-Wen Huang; Tung-Chien Chen; Tu-Chih Wang; Liang-Gee Chen, "Analysis and architecture design of variable block-size motion estimation for H.264/AVC", IEEE TCSVT, vol.53, no.3, pp.578-593, March 2006

[5] Yap, S.Y.; Mccanny, J.V., ‘A VLSI architecture for advanced video coding motion estimation’, ASAP, pp. 293-301, 24-26 June 2003

[6] Chao-Yung Kao and Youn-Long Lin, “An AMBA-Compliant Motion Estimator For H.264 Advanced Video Coding” IEEE International SOC Conference (ISOCC), Seoul, Korea, October 2004

[7] Brian M. Li , Philip H. Leong, “Serial and Parallel FPGA-based Variable Block Size Motion Estimation Processors”, Journal of Signal Processing Systems, Vol. 51 , No. 1, pp. 77-98 April 2008

[8] Bing-Fei Wu; Hsin-Yuan Peng; Tung-Lung Yu, "Efficient Hierarchical Motion Estimation Algorithm and Its VLSI Architecture," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , vol.16, no.10, pp.1385-1398, Oct. 2008.

[9] Yu-Wen Huang, Ching-Yeh Chen, Chen-Han Tsai, Chun-Fu Shen, Liang-Gee Chen, “Survey on Block Matching Motion Estimation Algorithms and Architectures with New Results”, The Journal of VLSI Signal Processing, Vol. 42, No. 3. (March 2006), pp. 297-320.

[10] Sheu-Chih Cheng; Hsueh-Min Hang, "A comparison of block-matching algorithms mapped to systolic-array implementation," IEEE TCSVT, IEEE Transactions on , vol.7, no.5, pp.741-757, Oct 1997

[11] T. Dias , S. Momcilovic , N. Roma , L. Sousa, “Adaptive motion estimation processor for autonomous video devices”, EURASIP Journal on Embedded Systems, v.2007 n.1, pp.41-41, January 2007

[12] Babionitakis, Konstantinos1, et al., “A real-time motion estimation FPGA architecture”, Journal of Real-Time Image Processing, Volume 3, Numbers 1-2, March 2008 , pp. 3-20(18)

[13] Sullivan, G.J.; Wiegand, T., "Rate-distortion optimization for video compression", Signal Processing Magazine, IEEE , vol.15, no.6, pp.74-90, Nov 1998

[14] Information available at http://www.xilinx.com/ products/ipcenter/DO-DI-H264-ME.htm

[15] S. Saponara, K. Denolf, G. Lafruit, C. Blanch, and J. Bormans, “Performance and complexity co-evaluation of the advanced video coding standard for cost-effective multimedia communications,” EURASIP J. Appl. Signal. Process., no. 2, Feb. 2004, pp. 220-235.

[16] Information available at http://www.videolan.org/developers/x264.html[17] 1080p HD sequences obtained from

http://nsl.cs.sfu.ca/wiki/index.php/Video_Library_and_Tools#HD_Sequences_from_CBC

[18] JM reference software [available on-line]. https://bs.hhi.de/suehring/tml/download

[19] Alfonso, D.; Rovati, F.; Pau, D.; Celetto, L., "An innovative, programmable architecture for ultra-low power motion estimation in reduced memory MPEG-4 encoder," Consumer Electronics, IEEE Transactions on , vol.48, no.3, pp. 702-708, Aug 2002

[20] Tourapis, H.-Y.C.; Tourapis, A.M., "Fast motion estimation within the H.264 codec," Multimedia and Expo, 2003. ICME '03. Proceedings. 2003 International Conference on Multimedia and Expo, vol.3, no., pp. III-517-20 vol.3, 6-9 July 2003.

[21] Toivonen, T.; Heikkila, J., "Improved Unsymmetric-Cross Multi-Hexagon-Grid Search Algorithm for Fast Block Motion Estimation", Image Processing, 2006 IEEE International Conference on , vol., no., pp.2369-2372, 8-11 Oct. 2006

[22] Varatkar, G.V.; Shanbhag, N.R., "Error-Resilient Motion Estimation Architecture," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , vol.16, no.10, pp.1399-1412, Oct. 2008

12