dsp in fpga

48
DSP in FPGA

Upload: selene

Post on 23-Feb-2016

103 views

Category:

Documents


1 download

DESCRIPTION

DSP in FPGA. Topics. Considerations When not to use Floating Point Example FP: Adder Hardware Circuit Constant Cache Data-path with Constant Cache FFT Example Other Examples: Simulink Equalizer Routing Challenge Routing Resources: Altera vs. Xilinx Example: Matrix Multiplication - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DSP in FPGA

DSP in FPGA

Page 2: DSP in FPGA

Topics Signal Processing

FPGA Applications with DSP DSP milestones PDSP Architecture PDSP vs FPGA

Example: FIR Filter DSP on FPGA

State of the Art Flexibility Multi-Channel Friendly

Resources DSP Slice Multiplication Modes IP Blocks

IP Block Example: FIR Filter

Considerations When not to use Floating Point Example FP: Adder Hardware

Circuit Constant Cache

Data-path with Constant Cache FFT Example Other Examples:

Simulink Equalizer Routing Challenge

Routing Resources: Altera vs. Xilinx

Example: Matrix Multiplication Hypothesis and Rule’s of Thumb Results

Paper Analysis

Page 3: DSP in FPGA

Signal Processing Transform or manipulate analog or digital

signal.

Most frequent application: filtering.

DSP has replaced related traditional analog signal processing systems in many applications.

Page 4: DSP in FPGA

FPGA’s Applications

Page 5: DSP in FPGA

MilestonesCooley and Tukey 1965

Efficient algorithm to compute the discrete Fourier Transform (DFT)

PDSP 1970 Compute (fixed-

point) “multiply-and-accumulate” in only one clock cycle

Today PDSPs:Floating-point multipliers, barrel shifters, memory banks, zero-overhead interfaces to A/D and D/A Converters

Page 6: DSP in FPGA

PDSP Architecture Single-DSP implementations have insufficient

processing power for today’s system’s complexity.

Multiple-chip systems: more costly, complex and higher power requirements.

Solution: FPGAs

Page 7: DSP in FPGA

Managing Resources &Design Reliability

Page 8: DSP in FPGA

FPGA vs. PDSPsPDSPs

RISC paradigm with MAC Advantage: multistage

pipeline architectures can achieve MAC rates limited only by speed of array multiplier.

Dominate applications that required complicated algorithms (e.g. several if-then-else constructs)

FPGA Implement MAC at higher cost. High-bandwith SP applications

through multiple MAC cells on one chip.

Algorithms: CORDIC, NTT or error-correction algorithms

Dominate more front-end (sensor) applications FIR filters, CORDIC algorithms FFTs

Page 9: DSP in FPGA

FPGA Advantages1. Ability to tailor the

implementation to match system requirements.

2. Multiple-channel or high-speed system: take advantage of the parallelism within the device to maximize performance,

3. Control logic implemented in hardware

Page 10: DSP in FPGA

Fir Filter Example

Page 11: DSP in FPGA

FPGA

Page 12: DSP in FPGA

State of the Art (Xilinx)

Page 13: DSP in FPGA

Flexibility How many MACs do you need? For example, in FIR Filter,

FPGAs can meet various throughput requirement

Page 14: DSP in FPGA

Multi-Channel Friendly Parallelism enables efficient implementation of

multi-channel into a single FPGA. Many low sample rate channels can be

multiplexed and processed at a higher rate.

Page 15: DSP in FPGA

Resources Challenge: How to

make the best use of resources in most efficient manner?

Page 16: DSP in FPGA

DSP48E1 Slice 2 DSP48E1 slices per tile Column Structure to avoid routing delay Pre-adder, 25x18 bit multiplier, accumulator Pattern detect, logic operation, convergent/symmetric rounding 638 MHz Fmax

Flexibility

Page 17: DSP in FPGA

Multiplication Modes Each DSP block in a Stratix

device can implement: Four 18x18-bit

multiplications, Eight 9x9-bit multiplication,

or One 36x36-bit multiplication

While configured in the 36x36 mode, the DSP block can also perform floating-point arithmetic.

Page 18: DSP in FPGA

DSP IP Portfolio Comprehensive

Constraint Driven

Page 19: DSP in FPGA

IP Block example Overclocking automatically used to reduce DSP

slice count. Quick estimates provided by IP compiler GUI Insures best results for your design

requirements.

Page 20: DSP in FPGA

Altera: DFPAU D-Floating Point

Arithmetic Coprocessor.

Replaces C software functions by fast hardware operations – accelerates system performance

Uses specialized algorithms to compute arithmetic functions

Page 21: DSP in FPGA

Altera: DFPAU

Page 22: DSP in FPGA

Hardware circuit for FP adder Breaking up an number into

exponent and mantissa requires pre- and post-processing

Comprises Alignment (100 ALMs) Operation (21 ALMs) Normalization (81 ALMs) Rounding (50 ALMs)

Normalization and rounding together occupy half of the circuit area

How to improve this?

Page 23: DSP in FPGA

When not to use Floating Point? Algorithms designed for fixed point

Greater precision and dynamic range are not helpful because algorithms are bit exact.

E.g. Transform to go to frequency domain in video codecs has some form of a DCT (Discrete Cosine Transform). Designed to be performed on a fixed-

point processor and are bit exact.

Also, when precision is not as important as speed

Page 24: DSP in FPGA

Constant Cache Some applications load data from memory once and reuse

it frequently Could pose a bottleneck on performance.

What can we do? Copying data to local memory

may not be enough, as each work group would have to perform the copy operation

Solution Create a constant cache that only loads data when it is not

present within it, regardless of which workgroup requires the data

i.e. FFT

Page 25: DSP in FPGA

Datapath with a Constant Cache

Page 26: DSP in FPGA

Example FFT

Large computation, can be pre-computed

Page 27: DSP in FPGA

Equalizer Example

Page 28: DSP in FPGA

Routing Challenge

Page 29: DSP in FPGA

Routing challenge Designed performance achieved only when the

datasets are readily accessed from fast on-chip SRAMs.

For large data sets, the main performance bottleneck is the off-chip memory bandwidth.

With DRAM, you can process data on stages with only a portion of dataset that fits on chip operated on at a time. Available memory

bandwidth determines performance.

Page 30: DSP in FPGA

Routing ResourcesXilinx: more local routing resources

Synergistic with DSP because most DSP algorithms process data locally.

Altera: wide buses

Also has value, because normally wide data vectors with 16 to 32 bits must be moved to the next DSP block.

Page 31: DSP in FPGA

Example: Matrix Multiplication Double-precisions FP cores (64 bits) Matrix operations require all matrix element

calculations to complete at the same time. These parallelized or “vector” operations will

occur at the slowest clock speed of all the FP functions in the FPGA.

Page 32: DSP in FPGA

Routing Challenge Hypothesis (constrained performance

prediction): Estimated 15 % logic unusable (due to data path

routing, routing constraints, etc.) Estimated 33 % decrease in FP function clock

speed Extra 24,000 ALUs for local SRAM memory

controller and processor interface39 +, 39 X

Clock Speed: 200 Mhz

Performance: 15.7 GFLOPS

Peak is:300 MHZ

25.5 GFLOPS

Page 33: DSP in FPGA

Routing Challenge Considerations:

Latency of transfer of A and B matrix from microprocessor to local FPGA SRAM not included in benchmark time.

Challenge when using all double-precision FP cores: feeding them with data on every clock cycle.

When dealing with double-precision 64-bit data, and parallelizing many FP arithmetic cores,

wide internal memory interfaces are needed.

Page 34: DSP in FPGA

Routing Challenge: Results Average sustained throughput : 88 percent.

40 multiply and 40 adder tree cores – result every clock cycle Five additional adder cores used for blocking implementation: one

value per clock cycle

The GFLOPS calculation then is 200 MHz * 81 operators * 88 percent duty cycle = 14.25 GFLOPS. Lower than expectation – due to the time needed to read and write

values to the external SRAM.

With multiple SRAM banks providing higher memory bandwidth, the GFLOPS would be closer to the 15.7 GFLOPS number.

Power:The expected 15 GFLOPS performance of the Stratix EP2S180 FPGA running at 30 W is close to the sustained performance of a 60-W 3-GHz Intel Woodcrest CPU

Page 35: DSP in FPGA

I.S. Uzun, A. Amira and A. Bouridane

FPGA implementations of fast Fourier transforms forreal-time signal & image processing

Page 36: DSP in FPGA
Page 37: DSP in FPGA
Page 38: DSP in FPGA
Page 39: DSP in FPGA
Page 40: DSP in FPGA

Functional block diagram of 1-D FFT processor architecture

Page 41: DSP in FPGA
Page 42: DSP in FPGA
Page 43: DSP in FPGA

AGU: Radix-2 DIF FFT

w s :¼ 1 for stage :¼ log 2 ðNÞ to 1 step 1 fnnstage loop m :¼ 2^stage is :¼ m=2 w index0 :¼ 0 for group :¼ 0 to n m step m fnngroup loop for bfi :¼ 0 to is l fnnbutterfly loop Index0 :¼ r þ j IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 3, June 2005 295 Index1 :¼ Index0 þ is; } w index0 :¼ w index0 þ w s; } w s :¼ w s 1

Page 44: DSP in FPGA
Page 45: DSP in FPGA
Page 46: DSP in FPGA
Page 47: DSP in FPGA
Page 48: DSP in FPGA