hardware 734 ece - university of...

Hardware Optimized DCT/IDCT Implementation on Verilog HDL

734 In this report, I explore 4 implementations for hardware based pipelined DCT/IDCT in Verilog HDL. Conventional DCT/IDCT implementations suffer from the amount of hardware requirement needed for storage and computations. This project is an attempt to optimize these important requirements and compare 4 implementations to conclude the best design point for the hardware based DCT/IDCT implementation. It has been observed that the Serial In implementation consumes around ~6% lesser area than parallel In implementation at a performance degradation of only ~4%.

Rahul Srikumar

ECE

1

Table of Contents Motivation..................................................................................................................................................... 2

Prior Work ..................................................................................................................................................... 3

The Discrete Cosine Transform ..................................................................................................................... 3

Introduction .......................................................................................................................................... 4

Four Implementations .......................................................................................................................... 5

Serial In ......................................................................................................................................................... 5

2 Parallel In.................................................................................................................................................... 8

4 Parallel In.................................................................................................................................................... 9

8 Parallel In.................................................................................................................................................. 10

Optimizations .............................................................................................................................................. 11

Synthesis and Results .................................................................................................................................. 11

Conclusion ................................................................................................................................................... 15

References .................................................................................................................................................. 15

2

Motivation

Discrete Cosine Transform(DCT) is one of the important image compression algorithms

used in image processing applications. It involves a lot of multiplications, additions and

also has a huge memory requirement. Several algorithms have been proposed over the

last couple of decades to reduce the number of computations and memory

requirements involved in the DCT computation algorithm.

Any algorithm that can reduce the total number of additions, multiplications or memory

requirement would be of profound significance to the image processing domain.

3

Prior Work

There has been a lot of research both in industry and academia on how to efficeintly

implement a fast DCT/IDCT hardware algorithm. Dae Won Kiln, et. al [1], proposed and

implemented a hardware Distributed Arithmetic(DA) method with radix-2 multibit coding

with minimum resource requirement by using transpose memory. Atitallah et. al [2]

compared Loeffler and DA algorithms to implement compression in H.264 nad

MPEG. Martuza et. al [3] presented a hybrid architecture for IDCT computation based

on the symmetric structure of matrices and similarity in matrix operations. The

proposed architecture derives its inspiration from all the above well set examples.

The Discrete Cosine Transform

A discrete cosine transform (DCT) expresses a sequence of finitely many data points in

terms of a sum of cosine functions oscillating at different frequencies i.e. it transforms a

signal from a spatial representation into a frequency representation. In an image, most

of the energy will be concentrated in the lower frequencies, so if I transform an image

into its frequency components and discard the higher frequency coefficients, I can

reduce the amount of data needed to describe the image without sacrificing too much

image quality. This is why DCT is popularly used in several image compression

algorithms. The DCT function used in image processing consists of sum of weighted

cosine functions at different frequencies.

The DCT of a function is expressed as follows

4

-------------------(1)

------------(2)

--------------(3)

Since images are 2-D objects, a 2-D DCT is required to get all pixels transformed into

the frequency domain. This computation involves 2 major steps.

(i) Computing the 1-D DCT of the rows of the pixel matrix.

(ii) Computing the 1-D DCT of the columns of the pixel matrix by computing the DCT of

the transpose of the matrix obtained in (i).

2-D DCT of an image is expressed as follows:

---------------(4)

------------(5)

--------------(6)

Introduction In my implementation, I explore four design points of my hardware implementation using

Verilog HDL and evaluate the area-performance trade-off. The design comprises of four

modules per design point. One module for DCT computation, One module for IDCT

5

computation, One top module that instantiates both the DCT and IDCT modules and a

test bench to test the entire design.

Core idea is to implement a fully-pipelined architecture that takes in 8 inputs and

provides a single DCT output which in turn is used to compute the IDCT. A 1D-DCT is

implemented on the input pixels first. The output of this so called the intermediate value

is stored in a RAM. The 2nd 1D-DCT operation is done on this stored value to give the

final 2D-DCT ouput dct_2d. The inputs are 8 bits wide and the 2d-dct outputs are 9 bits

wide. A 1D-IDCT is implemented on the input DCT values. This intermediate value is

stored in a RAM. The 2nd 1D-IDCT operation is done on this stored value to give the

final 2D-IDCT output idct_2d. The inputs are 9 bits wide and the 2d-idct outputs are 8

bits wide. The nuances of the 4 design points have been provided in great details in the

sections that follow.

Four Implementations

Serial In

1st 1D section

The input signals are taken one pixel at a time in the order x00 to x07, x10 to x17 and

so on until x77. These inputs are fed into a 8 bit shift register. The outputs of the 8 bit

shift registers are registered by the divide by 8 clock which is the CLOCK signal divided

by 8. This will enable us to register in 8 pixels (one row) at a time. The pixels are paired

6

up in an adder/subtractor in the order xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The adder

subtractor is tied to CLOCK. For every clock, the adder/subtractor module alternately

chooses addition and subtraction. This selection is done by the toggle flop. The output

of the adder/subtractor is fed into a multiplier whose other input is connected to stored

values in registers acting as memory. The outputs of the 4 multipliers are added at

every clock in the final adder. The output of the adder z_out is the 1D-DCT values

given out in the order in which the inputs were read in.

It takes 8 clocks to read in the first set of inputs, 1 clock to register inputs,1 clock to do

add/sub, 1clock to get absolute value, 1 clock for multiplication, 2 clock for the final

adder. total = 14 clocks to get the 1st z_out value. Every subsequent clock gives out

the next z_out value. So to get all the 64 values we need 14+63=77 clocks.

Storage/RAM section

The outputs z_out of the adder are stored in RAMs. Two RAMs are used so that data

write can be continuous. The 1st valid input for the RAM1 is available at the 15th clock.

So the RAM1 enable is active after 15 clocks. After this the write operation continues

for 64 clocks . At the 65th clock, since z_out is continuous, we get the next valid

z_out_00. This 2nd set of valid 1D-DCT coefficients are written into RAM2 which is

enabled at 15+64 clocks. So at 65th clock, RAM1 goes into read mode for the next 64

clocks and RAM2 is in write mode. The 2 RAMS alternate between read and write

every 64 clock cycles.

7

2nd 1D-DCT section

After the 1st 77 clocks when RAM1 is full, the 2nd set of 1D calculations can start. The

second 1D implementation is the same as the 1st 1D implementation with the inputs

now coming from either RAM1 or RAM2. Also, the inputs are read in one column at a

time in the order z00 to z70, z10 to z70 up to z77. The outputs from the adder in the

2nd section are the 2D-DCT coefficients.

1st 1D-IDCT section

The input signals are taken one pixel at a time in the order x00 to x07, x10 to x07 and

so on up to x77. These inputs are fed into a 8 bit shift register. The outputs of the 8 bit

shift registers are registered at every 8th clock .This will enable us to register in 8 pixels

(one row) at a time. The pixels are fed into a multiplier whose other input is connected

to stored values in registers which act as memory. The outputs of the 8 multipliers are

added at every CLOCK in the final adder. The output of the adder z_out is the 1D-IDCT

values given out in the order in which the inputs were read in. It takes 8 clocks to read in

the first set of inputs, 1 clock to get the absolute value of the input, 1 clock for

multiplication, 2 clock for the final addition which adds up to a total of 12 clocks to get

the 1st z_out value. Every subsequent clock gives out the next z_out value. So to get all

the 64 values we need 12+64=76 clocks.

Storage / RAM section

The outputs z_out of the adder are stored in RAMs. Two RAMs are used so that data

write can be continuous. The 1st valid input for the RAM1 is available at the 12th clock.

8

So the RAM1 enable is active after 11 clocks. After this the write operation continues

for 64 clocks . At the 65th clock, since z_out is continuous, we get the next valid

z_out_00. This 2nd set of valid 1D-DCT coefficients are written into RAM2 which is

enabled at 12+64 clocks. So at 65th clock, RAM1 goes into read mode for the next 64

clocks and RAM2 is in write mode. After this for every 64 clocks, the read and write

switches between the 2 RAMS.

2nd 1D-IDCT section

After the 1st 76th clock when RAM1 is full, the 2nd 1d calculations can start. The

second 1D implementation is the same as the 1st 1D implementation with the inputs

now coming from either RAM1 or RAM2. Also, the inputs are read in one column at a

time in the order z00 to z70, z10 to z70 up to z77. The outputs from the adder in the

2nd section are the 2D-IDCT coefficients.

2 Parallel In

1st 1D section

The input signals are taken 2 pixels at a time in the order x00:x01, x02:x03 and so on

up to x06:x07. A divide by 4 clock is used to clock in 4 sets of 2 pixels to get 8 pixels.

The pixels are paired up in an adder/subtractor in the order

xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The adder subtractor is tied to CLOCK. For every

clock, the adder/subtractor module does 4 additions and 4 subtractions. The output of

the add/sub is fed into a multiplier whose other input is connected to stored values in

registers which act as memory. The output of the 8 multipliers are added at every

9

CLOCK in the final adder. The output of the adder z_out is the 1D-DCT values given

out in the order in which the inputs were read in.

The difference is that it takes 4 clocks to register the inputs and sign extension, 1 clock

to do add/sub, 1clock to get separate sign + absolute value, 1 clock for multiplication, 2

clock for the final adder. total = 9 clocks to get the 1st z_out value. Every subsequent

clock gives out the next z_out value. So to get all the 64 values we need 9+63=72

clocks.

The remaining portions of the DCT/IDCT computation process is similar to the serial In

implementation.

4 Parallel In

The input signals are taken 4 pixels at a time in the order x00:x03 and x04:x07. A divide

by 2 clock is used to clock in 2 sets of 4 pixels to get 8 pixels. The pixels are paired up

in an adder/subtractor in the order xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The adder

subtractor is tied to CLOCK. For every clock, the adder/subtractor module does 4

additions and 4 subtractions. The output of the add/sub is fed into a multiplier whose

other input is connected to stored values in registers which act as memory. The output

of the 8 multipliers are added at every CLOCK in the final adder. The output of the

adder z_out is the 1D-DCT values given out in the order in which the inputs were read

in.

10

In this implementation, it takes 2 clocks to register the inputs and sign extension, 1 clock




clocks.


implementation.

8 Parallel In

The input signals are taken 8 pixels at a time in the order x00::x07. The pixels are

paired up in an adder/subtractor in the order xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The

adder subtractor is tied to CLOCK. For every clock, the adder/subtractor module does

4 additions and 4 subtractions. The output of the add/sub is fed into a multiplier whose

other input is connected to stored values in registers which act as memory. The output

of the 8 multipliers are added at every CLOCK in the final adder. The output of the

adder z_out is the 1D-DCT values given out in the order in which the inputs were read

in.

In this implementation, it takes 1 clock to register the inputs and sign extension, 1 clock




clocks.

11


implementation.

Optimizations

Some of the optimizations I included are 2 RAMs for storage. Each RAM can store 64

pixels. When the first 1D-DCT value is available, the first RAM goes into write mode and

remains in write mode for the next 63 clocks. Afterwards, it switches to read mode and

the second RAM goes into write mode. The next set of 1D DCT coefficients are stored

in the second RAM while the first RAM's DCT values are used for 2D DCT computation.

As a result, the 2 RAMs alternate between read and write every 64 clocks. This helps us

to achieve a fully pipelined design.

For DCT computation its needed to store 64 Cosine coefficients for an 8 point DCT. In

my design another main optimization was to use only 8 registers that get 8 coefficients

every clock cycle. These values keep changing every clock cycle providing the multiplier

with appropriate DCT Cosine coefficients. This enables in effectively reducing the

hardware requirement by (1/8)th of conventional designs.

Synthesis and Results

Figure 1 shows the Modelsim Simulation results of the Serial In implementation of the

DCT computation process.

12

Figure 1: Modelsim simulation of serial in DCT computation

All four implementations were synthesized on Quartus using Altera Cyclone IV FPGA.

Some of the results that were obtained from Quartus are as shown in Figure 2.

Figure 2: Synthesis Summary of Serial In DCT implementation

13

Figure 3: Combinational blocks in 4 implementations

Figure 4: Number of registers for 4 implementations

5600

5700

5800

5900

6000

6100

6200

6300

6400

combinational blocks

Combinational Blocks

8 Parallel

4 Parallel In

2 Parallel In

Serial In

4520

4540

4560

4580

4600

4620

4640

4660

4680

4700

4720

Registers

Registers

8 Parallel

4 Parallel In

2 Parallel In

Serial In

14

Figure 5: Total Computation time for 4 implementations

S No. Design

Type

Registers combinational

blocks

Pins Cycles to

1D DCT

Cycles to

2D DCT

Cycles to

1D IDCT

Cycles to

2D IDCT

1 8 Parallel 4706 6390 74 69 146 161 236

2 4 Parallel

In

4706 6390 42 70 147 162 237

3 2 Parallel

In

4702 6380 26 72 149 164 239

4 Serial In 4587 5869 18 77 154 169 246

Table 1: Tabulates the number of cycles to compute various results at 4 design points.

It can be noted from Figures 3,4 and 5 that the Total computation time of Serial In is 246

cycles and that of 8 parallel In is about 236 cycles, although the hardware requirement

is pretty less for the serial in implementation.

230

232

234

236

238

240

242

244

246

Cycles to 2D IDCT of 8*8 block

Total Computation Time

8 Parallel

4 Parallel In

2 Parallel In

Serial In

15

Conclusion

It can be concluded that the serial In consumes 6% lesser area than the 8 parallel

implementation at a performance degradation of only about 4%. Hence for non-

performance critical, low power and low area applications serial In implementation

should be preferred over other implementations.

References

[1]. Dae Won Kiln, Taeh- Won Kwon, Jiing Min Seo, Jae Kiln Ei, Silk Kyu Lee, Jmg Hee

Silk, Jim Rim Choi A compatible dct/idct architecture using hardwired distributed

arithmetic.

[2]. A. Ben Atitallah, P. Kadionik, F. Ghozzi, P.Nouel, N. Masmoudi, Ph.Marchegay

Optimization and implementation on fpga of the dct/idct algorithm.

[3]. Muhammad Martuza, Carl McCrosky and Khan Wahid A fast hybrid dct

architecture supporting h.264, vc-1, Mpeg-2, avs and jpeg codecs.

[4]. Taizo Suzuki and Masaaki Ikehara Integer DCT Based on Direct-Lifting of DCT-

IDCT for Lossless-to-Lossy Image Coding.

[5]. Hui-Cheng Hsu, Kun-Bin Lee, Nelson Yen-Chung Chang, and Tian-Sheuan Chang,

Architecture Design of Shape-Adaptive Discrete Cosine Transform and Its Inverse

for MPEG-4 Video Coding.

[6]. Kibum Suh , Kyung Yuk Min, Kyeounsoo Kim, Jong-Seog Koh Jong-Wha Chong A

design of dpcm hybrid coding loop using single 1-d dct In mpeg-2 video encoder.

hardware 734 ece - university of...

Documents