digital signal processing on reconfigurable...

11

Digital Signal Processing Digital Signal Processing on Reconfigurable on Reconfigurable

Computing SystemsComputing SystemsOliver LiuOliver Liu

ENGG*6090 : Reconfigurable Computing ENGG*6090 : Reconfigurable Computing SystemsSystems

Winter 2007Winter 2007

22

ReferencesReferencesReconfigurable Computing: Accelerating Computation Reconfigurable Computing: Accelerating Computation with Fieldwith Field-- ProgrammableProgrammable--GateGate--Array, Array, Chapter 5. By Maya B. GokhaleChapter 5. By Maya B. GokhaleThe Design WarriorThe Design Warrior’’s Guide to FPGAs, Chapter 12. By C. s Guide to FPGAs, Chapter 12. By C. Maxfeild.Maxfeild.Andrew Y. Lin, Implementation Consideration for FPGAAndrew Y. Lin, Implementation Consideration for FPGA--Based Adaptive Transversal Filter Design. Master Based Adaptive Transversal Filter Design. Master Thesis, University of Florida,2003Thesis, University of Florida,2003Ali M. AlAli M. Al--Haj, Fast Discrete Wavelet Transformation Haj, Fast Discrete Wavelet Transformation Using FPGAs and Distributed Arithmetic. Department of Using FPGAs and Distributed Arithmetic. Department of Electronics Engineering, Princess Sumaya University for Electronics Engineering, Princess Sumaya University for Technology, AlTechnology, Al--Jubeiha P.O. Box 1438, Amman 11941, Jubeiha P.O. Box 1438, Amman 11941, Jordan, 2003Jordan, 2003

33

IntroductionIntroduction

Why Use Reconfigurable Computing for DSP?Why Use Reconfigurable Computing for DSP?Advantage and disadvantage of RC for DSP. Advantage and disadvantage of RC for DSP. Explorations in parallel DSP processing in FPGAExplorations in parallel DSP processing in FPGA..

Some basic DSP Application Building BlocksSome basic DSP Application Building BlocksMAC, MultiplyMAC, Multiply--Accumulate Unit for DSP.Accumulate Unit for DSP.Bit_serial Adder, Parallel Distributed Arithmetic MultiplieBit_serial Adder, Parallel Distributed Arithmetic Multiplier.r.DSP components.DSP components.

Some FPGA Centric DSP Design ToolsSome FPGA Centric DSP Design ToolsAssembly, C/C++, HandleAssembly, C/C++, Handle--C, RTL, Xilinx Core Generator, C, RTL, Xilinx Core Generator, Xilinx Core Generator, MAtlab/Simulink, Xilinx System Xilinx Core Generator, MAtlab/Simulink, Xilinx System GeneratorGenerator

44

Advantage and disadvantage of RC for DSP (1)Advantage and disadvantage of RC for DSP (1)

Technology Performance Cost Power Flexibility Memory BW I/O BW

GPP

PDSP

ASIC

FPGA

LOW

Med-High

HIGH

Medium

LOW

Medium

HIGH

Low

HIGH

Medium

LOW

Low-Medium

HIGH

Medium

LOW

HIGH

LOW

Medium

HIGH

HIGH

LOW

LOW

HIGH

HIGH

55

Advantage and disadvantage of RC for DSP (2)Advantage and disadvantage of RC for DSP (2)

AdvantagesAdvantagesParallel processingParallel processing capability achieve high performance.capability achieve high performance.flexible architecture flexible architecture reduce the riskreduce the risk of product development.of product development.Design can be changed Design can be changed during the evolution of the productduring the evolution of the productWord widthsWord widths can be flexible.can be flexible.Lower powerLower power than DSP.than DSP.Price Price is becoming lower.is becoming lower.

DisadvantagesDisadvantagesPower consumption and performance is Power consumption and performance is lower than ASIClower than ASIC..

66

Explorations in parallel DSP processing in Explorations in parallel DSP processing in Reconfigurable Computing System (1)Reconfigurable Computing System (1)

a(0) a(1) a(2) a(3)

Data In Reg Reg Reg Reg

a(0)

Data Out

a(1) a(2) a(3)

Data In Reg0 Reg1 Reg2 Reg3

Reg Reg Reg Reg

Reg Reg

Data Out

77

Explorations in parallelism DSP processing Explorations in parallelism DSP processing in Reconfigurable Computing System (2)in Reconfigurable Computing System (2)

Most DSP applications require several operations such Most DSP applications require several operations such as FIR filters, transforms, etc. to process each incoming as FIR filters, transforms, etc. to process each incoming data stream, providing the potential to exploit data stream, providing the potential to exploit coarsecoarse--grained parallelism in FPGA.grained parallelism in FPGA.DSP applications often use fixed coefficients or DSP applications often use fixed coefficients or constants throughout their applications. By constants throughout their applications. By ““foldingfolding”” the the constants directly into hardware, i.e., customizing the constants directly into hardware, i.e., customizing the hardware for giving constant, the hardware for giving constant, the area and speedarea and speed of of operations can be significantly improved.operations can be significantly improved.Reconfigurable computingReconfigurable computing’’s ability to supply both flexible s ability to supply both flexible and significant and significant memory bandwidthmemory bandwidth also improves the also improves the possible parallelism that can be extracted in DSP possible parallelism that can be extracted in DSP applications.applications.

88

Some DSP Application Building Blocks (1)Some DSP Application Building Blocks (1)

The most commonly used DSP functions are The most commonly used DSP functions are FIR (Finite Impulse response) filters,FIR (Finite Impulse response) filters,IIR (Infinite Impulse response) filters, IIR (Infinite Impulse response) filters, FFT (Fast Fourier Transform), FFT (Fast Fourier Transform), DCT (Direct Cosine Transform),DCT (Direct Cosine Transform),Encoder/Decoder and Error Correction/Detection Encoder/Decoder and Error Correction/Detection functions.functions.

All of these blocks perform intensive arithmetic All of these blocks perform intensive arithmetic operations such as operations such as

add, subtract, multiply, multiplyadd, subtract, multiply, multiply--add or multiplyadd or multiply--accumulate.accumulate.

99


A B

Sum

x

+

x

+

A[n:0]

B[n:0] Y[(2n - 1):0]

Multiplier

Adder

Accumulator

MAC

Q D

Clr Clk

MAC unit Bit-Serial Adder unit

1010


ROM16x12 bits

ROM16x12 bits

12 bit Adder

Input[7:0]

UPP[11:0] LPP[11:0]

Sum Sum

Input[7:4] Input[3:0]

Addr[3:0] Addr[3:0]

LPP[3:0]

8-bit by 8-bit Parallel Distributed Arithmetic Multiplier

1111


Efficient Memory Structures (Efficient Memory Structures (LUTsLUTs))Filters Filters -- IIR, FIR, LMS, etcIIR, FIR, LMS, etc..Fast Fourier Transforms (Fast Fourier Transforms (FFTFFT))Discrete Cosine Transform (Discrete Cosine Transform (DCTDCT))Discrete Wavelet Transform (Discrete Wavelet Transform (DWTDWT))

1212

Some FPGA Centric DSP Design Tools and Some FPGA Centric DSP Design Tools and LanguagesLanguages

Assembly, C, C++Assembly, C, C++VHDL/Verilog (RTL code)VHDL/Verilog (RTL code)Xilinx EDKXilinx EDKXilinx ISEXilinx ISEMentor Graphic ModelsimMentor Graphic ModelsimXilinx Core GeneratorXilinx Core GeneratorMATLAB/SimulinkMATLAB/SimulinkXilinx System GeneratorXilinx System Generator

1313

Topics CoveredTopics Covered

Implementation Consideration for FPGAImplementation Consideration for FPGA--Based Adaptive Transversal Filter Design.Based Adaptive Transversal Filter Design.Fast Discrete Wavelet Transformation Fast Discrete Wavelet Transformation Using FPGAs and Distributed ArithmeticUsing FPGAs and Distributed ArithmeticENG6090 Project Status: Image ENG6090 Project Status: Image Compression using Wavelet Filter Bank on Compression using Wavelet Filter Bank on Reconfigurable Computing SystemReconfigurable Computing System

1414

Implementation Consideration for Implementation Consideration for FPGAFPGA--Based Adaptive Transversal Based Adaptive Transversal

Filter DesignFilter Design

Andrew Y. Lin, Andrew Y. Lin, Master Thesis, University of Florida, 2003Master Thesis, University of Florida, 2003

1515

Problem Statement and Purpose of the Design Problem Statement and Purpose of the Design

Due to finite precisions in digital hardware, quantization must be performed in either or all of the following areas:

• Input and reference signals; • Product quantization in convolution stage; • Coefficient quantization in adaptation stage.

Quantization noise is introduced in all of the above areas. The effects of quantization are discussed in this thesis.This thesis also investigates the performance among FGPAs and DSP processors in terms of speed and power consumption.

1616

Adaptive Algorithms, Finite Precision Effects on Adaptive Adaptive Algorithms, Finite Precision Effects on Adaptive Algorithms (1) Algorithms (1) ---- Rounding EffortsRounding Efforts

EE is the is the expectation of the rounding error

XX is the error is the error caused by roundingq is the quantizing stepsPP is pdf functionis pdf functionσσ is is power spectral density of the rounding error

1717

Adaptive Algorithms, Finite Precision Effects on Adaptive Adaptive Algorithms, Finite Precision Effects on Adaptive Algorithms (2) Algorithms (2) ---- Truncatoin EffortsTruncatoin Efforts

EE is the is the expectation of the truncation errorXX is the error is the error caused by truncationq is the quantizing stepsPP is pdf functionis pdf functionσσ is is power spectral density of the truncation error

1818

Adaptive Algorithms, Finite Precision Effects on Adaptive Adaptive Algorithms, Finite Precision Effects on Adaptive Algorithms (3) Algorithms (3) ---- Rounding Efforts on LMS filterRounding Efforts on LMS filter

Input Quantization Effects (AD)(AD)

ε(nT) is the quantization noiseArithmetic Rounding EffectsProduct Rounding EffectsCoefficient Rounding EffectsRounding Effects at the Adaptation StageEffects Rounding at the Convolution Stage

LMS filter Slowdown and StallingSaturation

Using the clamping technique in which upon detecting saturation, the result is “clamped”to the most positive or most negative number,depending on the sign bit.Alternatively, the sign algorithm is anotherway to reduce/avoid stalling.

1919

Implementation of an Integer Based Adaptive Noise Implementation of an Integer Based Adaptive Noise Canceller in Stratix Devices (1)Canceller in Stratix Devices (1)

---- Software SimulationSoftware SimulationThe sampled desired discrete signal, composed of both the speaker’s speech and the vacuum noise, is served as the Noise Canceller’s reference signal; another vacuum noise, also sampled, is served as the filter’s primary input signal. Upon processing, the vacuum noise will be reduced due to the adaptation of the filter tap weights. And the error signal produced by the adaptive system is in close resemblance of the original speech.

2020


---- Software Simulation ResultsSoftware Simulation ResultsSince the primary and reference signal quantization is unavoidable due to A/D conversion, the only source of error that can be controlled by the designer is then product quantization noise at both the convolution stage and the adaptation stage.

2121


---- Hardware ImplementationHardware Implementation

The newest FPGA families, Altera’s Stratix device family for example, incorporates embedded DSP blocks within the FPGA chip to have dedicated circuitry to perform common DSP operations including multiply and accumulate.This family of FPGA devices is compared with another family of FPGA devices that does not include embedded DSP blocks.DSP applications including adaptive systems have traditionally been implemented using general-purpose DSP processors due to their ability to perform fast arithmetic operations.

2222


---- Hardware ImplementationHardware Implementation

Hardware Block DiagramHardware Block Diagram

2323

Conclusion (1)Conclusion (1)

----Stratix vs Traditional FPGAs

Speed and Area ComparisonSpeed and Area ComparisonStratixStratix——with onwith on--chip DSP componentschip DSP componentsAPEXAPEX——traditional FPGA without DSP componentstraditional FPGA without DSP components

2424

Conclusion (2)Conclusion (2)---- FPGAs vs DSP Processors

Power Consumption ComparisonPower Consumption ComparisonStratix Stratix –– with onwith on--chip DSP componentschip DSP componentsTMS320VC33, DSP56390 TMS320VC33, DSP56390 –– Traditional DSP devicesTraditional DSP devices

2525

Fast Discrete Wavelet Transformation Fast Discrete Wavelet Transformation Using FPGAs and Distributed Using FPGAs and Distributed

ArithmeticArithmetic

Ali M. AlAli M. Al--Haj, Haj, Department of Electronics Engineering,Department of Electronics Engineering,

Princess Sumaya University for Technology, Princess Sumaya University for Technology, AlAl--Jubeiha P.O. Box 1438, Jubeiha P.O. Box 1438, Amman 11941, Jordan, 2003Amman 11941, Jordan, 2003

2626

Problem Statement and Purpose of the DesignProblem Statement and Purpose of the Design

programming such multiprocessor systems is a tedious, difficult, and time consuming task.multiprocessor implementations of the discrete wavelet transformare not cost effective since parallelism comes at the expense of augmenting the system with more processing engines operating in parallel.Custom VLSI circuits are inherently inflexible and their development is costly and time consuming, and thus they are not an attractive option for implementing the wavelet transformFPGAs maintain the advantages of the custom functionality of VLSI ASIC devices, while avoiding the high development costs and the inability to make design modifications after production. Furthermore, FPGAs inherit design flexibility and adaptability of software implementations.Our discrete wavelet transform implementation is exploiting the natural match between the Virtex architecture and distributed arithmetic

2727

Basic Wavelet ComputationBasic Wavelet Computation

System diagram and wavelet coefficientsSystem diagram and wavelet coefficients

2828

Distributed Arithmetic & Virtex FPGAs (1)Distributed Arithmetic & Virtex FPGAs (1)---- Distributed ArithmeticDistributed Arithmetic

Let the variable Y hold the result of an inner product operation between a data vector x and a coefficient vector a. The conventional representation the inner product operation is given as follows:

Where the input data words xi have been represented by the 2’s complement number presentation in order to bound number growth under multiplication. The variable xij is the jth bit of the xi word which is Boolean, B is the number of bits of each input data word and x0i is the sign bit.

2929

Distributed Arithmetic & Virtex FPGAs (1)Distributed Arithmetic & Virtex FPGAs (1)---- Distributed ArithmeticDistributed Arithmetic

Distributed Arithmetic implemented in FPGADistributed Arithmetic implemented in FPGA

3030

Distributed arithmetic implementationDistributed arithmetic implementation

Distributed Arithmetic Filter implemented in FPGADistributed Arithmetic Filter implemented in FPGA

3131

Functional simulationFunctional simulation

Forward and Inverse DWT function simulationForward and Inverse DWT function simulation

3232

Performance evaluation (1)Performance evaluation (1)

Speed comparisonSpeed comparison between conventional arithmetic implementation between conventional arithmetic implementation and distributed arithmetic implementationand distributed arithmetic implementation

3333

Performance evaluation (2)Performance evaluation (2)

Resource usage comparisonResource usage comparison between conventional arithmetic between conventional arithmetic implementation and distributed arithmetic implementationimplementation and distributed arithmetic implementation

3434

Conclusion and Further Work (1)Conclusion and Further Work (1)---- ConclusionsConclusions

Two Implementations using the highly parallel Virtex filed programmable gate array devices (FPGAs), and two software implementations; one using the TMS320C6711 digital signal processor and the other using the 800 MHz Pentium III Intel processor.Implementation which was based on the distributed arithmeticalgorithm achieved the best performance results.Two software implementations were far inferior to the FPGAimplementations in terms of execution speed.The TMS320C6711 digital signal processor performed much better than the Pentium III , however, its performance is still much lower the performance of the least efficient, direct FPGA implementationUsing FPGAs, coupled with reformulating the computation of the wavelet transform in accordance with the distributed arithmetic algorithm, results in the performance levels required for real-time implementations.

3535

Conclusion and Further Work (2)Conclusion and Further Work (2)---- Further WorkFurther Work

After completing this FPGA implementation of the discrete wavelet transform and its inverse, we are now working on integrating a whole wavelet-based image compression system on a single, dynamic, runtime reconfigurable FPGA.A typical image compression system consists of an encoder and a decoder. At the encoder side, an image is first transformed to the frequency domain using the forward discrete wavelet transform.The non-negligible wavelet coefficients are then quantized, and finally encoded using an appropriate entropy encoder. The decoder side reverses the whole encoding procedure described above.Transforming the 2-D image data can be done simply by inserting a matrix transpose module between two 1-D discrete wavelet transform modules such as those described in this paper.

3737

Image Compression using Wavelet Image Compression using Wavelet Filter Bank on Reconfigurable Filter Bank on Reconfigurable

Computing SystemComputing System

Oliver LiuOliver LiuENGG*6090 : Project of Reconfigurable ENGG*6090 : Project of Reconfigurable

Computing SystemsComputing SystemsWinter 2007Winter 2007

3838

OutlineOutlineProblem Statement and Purpose of the Design Problem Statement and Purpose of the Design Experiment Environment Experiment Environment Transform and Coding AlgorithmsTransform and Coding AlgorithmsSoftware ImplementationSoftware ImplementationSW/HW implementation (on going)SW/HW implementation (on going)Hardware Implementation (on going)Hardware Implementation (on going)ResultsResultsConclusionConclusion

3939

Problem Statement and Purpose of the Design (1)Problem Statement and Purpose of the Design (1)----IntroductionIntroduction

A typical image compression system consists of an A typical image compression system consists of an encoder and a decoder. At the encoder side, an image is encoder and a decoder. At the encoder side, an image is first transformed to the frequency domain using the first transformed to the frequency domain using the forward discrete wavelet transform.forward discrete wavelet transform.The nonThe non--negligible wavelet coefficients are then negligible wavelet coefficients are then quantized, and finally encoded using an appropriate quantized, and finally encoded using an appropriate entropy encoder. The decoder side reverses the whole entropy encoder. The decoder side reverses the whole encoding procedure described above.encoding procedure described above.An image compression system will be implemented An image compression system will be implemented using Reconfigurable Computing Platform.using Reconfigurable Computing Platform.

4040

Problem Statement and Purpose of the Design (2)Problem Statement and Purpose of the Design (2)---- System DiagramSystem Diagram

ForwardWavelet

Filter Bank

HP Sub-Image Huffman Coding

LP Sub Image Run-length Coding

Huffman decoding

Run-length decoding

HP Sub-Image

LP Sub Image

BackwardWavelet

Filter Bank

4141

Problem Statement and Purpose of the Design (3)Problem Statement and Purpose of the Design (3)---- Problem DefinitionProblem Definition

One implementation is to One implementation is to implement the transforming, implement the transforming, quantization and codingquantization and coding all in softwareall in software and run them on and run them on a microprocessor on a FPGA.a microprocessor on a FPGA.Other implementations will put either Other implementations will put either one or all of the one or all of the transforming, quantization or coding to hardwaretransforming, quantization or coding to hardware and and rest of them run on a microprocessor on the FPGA.rest of them run on a microprocessor on the FPGA.A RTOS will be used to observe the performance of A RTOS will be used to observe the performance of different implementations controlled by different implementations controlled by multimulti--processesprocesses..

4242

Xilinx Multimedia BoardXilinx Multimedia Board

The onThe on--board board Xilinx VertexXilinx Vertex--II xc2v200II xc2v200 is is used to implement different architecture.used to implement different architecture.The onThe on--board board external 2M memoryexternal 2M memory will be will be used to store compressed and used to store compressed and decompressed images and original image.decompressed images and original image.The The MFS file systemMFS file system is being used to store is being used to store image files.image files.Xilinx real time operation system kernel Xilinx real time operation system kernel xikernelxikernel is being used in this design.is being used in this design.

4343

Transform and Coding AlgorithmsTransform and Coding Algorithms(1) (1) ---- Wavelet Filter BankWavelet Filter Bank

System diagram and wavelet coefficientsSystem diagram and wavelet coefficients

4444

Transform and Coding AlgorithmsTransform and Coding Algorithms(2) (2) –– Huffman CodingHuffman Coding

Length Code Source Probability

a1

a2

a3

a4

a5

a6

a7

0.20

0.19

0.17

0.15

0.10

0.01

0.18

0

10

1

0

10

1

1

1

0

0

2 11

2 10

0113

3 101

3 001

4 0001

4 0000

4545

Transform and Coding AlgorithmsTransform and Coding Algorithms(3) (3) –– Run Length Coding (RLC)Run Length Coding (RLC)

Consider a character run of 15 'A' characters which normally would require15 bytes to store :

AAAAAAAAAAAAAAA

15A

With RLE, this would only require two bytes to store, the count (15) is stored as the first byte and the symbol (A) as the second byte.

4646

Software ImplementationSoftware Implementation

4747

Hardware ImplementationHardware Implementation

4848

Thank YouThank You

Questions ?Questions ?

digital signal processing on reconfigurable...

Documents