introduction to dsp processors - המחלקה להנדסת …adcomplab/introduction to dsp...

DSP lab 1

ד''בס

12/19/2015 1

Introduction to DSP processors

Presented by:

ד''בס

12/19/2015 2

Contents:

The modern processor’s classification;

The digital signal processing methods &

algorithms;

The D(igital) S(ignal) P(rocessing) algorithms implementation

The SHARC processor - architecture; data types & formats, C & Assembler;

Getting started.

DSP lab 2

ד''בס

12/19/2015 3

FPGA & EPLD

The modern processor’s classification.

Today chips are distributed into three groups:

ASIC’s (Application

Specific Integrated Circuits)

Chips with hardware realization

of data processing algorithms

(microprocessors & microcontrollers)

ד''בס

12/19/2015 4

The modern processor’s classification.

Microprocessors & microcontrollers

DSP microprocessors.

The processors are intended for Real-Time

Digital Signal Processing systems.

General-purpose microprocessors.

This kind of processor is intended for

computer systems: PC, workstation &

parallel supercomputer.

Microcontrollers

Very especial processors are intended for embedded systems

and in different

household devices.

DSP lab 3

ד''בס

12/19/2015 5

Review: Processor Classes

General Purpose - high performance – Pentiums, Alpha's, SPARC

– 64-128 bit word size

– Used for general purpose software

– Heavy weight OS - UNIX, NT

– Multiply layers of cache memory

– Workstations, PC's

Embedded processors and processor cores – ARM9, ARC, 486SX, Hitachi

SH7000, NEC V800

– 32 bit word size

– Single program

– Lightweight, often real-time OS

– Code and Data memory cache, DSP support

– Cellular phones, consumer electronics (e. g. CD players)

DSP processors – SHARC, BlackFin, TMS320C55x,

TMS320C67x, TMS320C64x

– 16-32 bit word size

– Single program

– Lightweight, often real-time OS

– Super Harvard Architecture Support, MAC, Circular buffer, Dual-Port RAM

– Audio, Image and Video processing, Coding and Decoding, Cellular Base Station, Adaptive Filtering, Real Time operations

Microcontrollers – PIC, AVR, HC11, ARM7, 8051,80251

– Extremely cost sensitive

– Small word size - 8 bit common

– Highest volume processors by far

– Automobiles, toasters, thermostats, ...

ד''בס

12/19/2015 7

The digital signal processing

methods & algorithms.

The analog signal processing example:

fC

fjwR

iR

fR

x(t)y(t)

11

y(t)

+

-

Ri

Rf

Cf

x(t)

f fc

DSP lab 4

ד''בס

12/19/2015 8



The digital signal processing system:

x(t)

Anti-aliasing filter

x’(t) A/D

x’(n)

D/A y’(t)

Smoothing filter y(t)

Digital filter or digital

transform

y’(n)

N

kknxk

kCny

0

fc

|H(f)|

ד''בס

The DSP vs. ASP

5th Order Analog Sallen Key Low-Pass Filter

5th Order Digital Filter of Direct Form II

12/19/2015 9

DSP lab 5

ד''בס

12/19/2015 10



Analog signal processing:

Cheaper;

More compact;

Power dissipation.

Digital signal processing:

More accurate;

More stable for different environments.

Analog signal processing versus Digital signal processing

ד''בס

12/19/2015 11



Time sampling: Amplitude quantization:

The basis concepts of DSP:

D

Df

TT

f1

or 1

FTf D

2

1or 2F

x(nT) – one sample at time T;

T – sample rate (time);

fD - sampling frequency:

Niquest frequency:

F – the highest frequency of signal.

x(nT) ~ x(n) ;

Resolution:

eQ quantization error:

NR 2

12max N

Qe

N– bit number.

DSP lab 6

ד''בס

12/19/2015 12


methods & algorithms. The basis concepts of DSP:

time.algorithm

time;sample

a

T

T

a del

.lg

;

;

1

del

orithmns in af operatio- number oN

uencyr clk freq- processof

timeoperation than

fNT

CLKCPU

op

CLKCPU

opopaa

ד''בס

12/19/2015 13

The D(igital) S(ignal) P(rocessing)

algorithms implementation

The major tasks in DSP:

Filter design (Linear Filtering);

Speech detection, image recognition (Spectral Analysis);

Image & Speech compression (Timing-Frequency Analysis);

Image & signal processing (Adaptation Filtering);

Coding, median filters (Non-Linear processing);

Interpolation & decimation (Multi Speed Processing).

DSP lab 7

ד''בס

12/19/2015 14



FIR filter

IIR filter

FFT

STD

Convolution

Correlation

The very usable DSP algorithms.

ד''בס

12/19/2015 15



FIR – filter

(Finite Impulse Response)

1

0

N

i

i inxbny

+

x(n)

+

Z-1 Z-1

+

+

y(n)

b2 b1 b0

DSP lab 8

ד''בס

12/19/2015 16



IIR – filter (Infinite Impulse Response)

1

0

1

1

N

i

M

k

ki knyainxbny

x(n)

y(n)

–

+

–

+

Z-1 Z-1

+

+

b2 b1 b0

a1 a0

ד''בס



12/19/2015 18

Discrete correlation

MmmngmfN

nyM

m

0 1

0

Discrete autocorrelation

MmmnfmfN

nyM

m

0 1

0

DSP lab 9

ד''בס



12/19/2015 20

Discrete convolution

M

Mm

M

Mm

mgmnfmngmfngnfny *

Circular discrete convolution

mngkMmfnyM

m

N

Nk

1

0

STD – standard deviation

1

0

21 N

i

iN xxN

S

ד''בס

12/19/2015 21



1,...,0,1

)(

21

0

NkenxN

kX N

iknN

n

Discrete Fourier Transform

N

ikn

kn

N eW

2

DSP lab 10

ד''בס

12/19/2015 23



X(0) = x(0)W80 + x(1)W8

0 + x(2)W80 + x(3)W8

0 + x(4)W80 + x(5)W8

0 + x(6)W80 + x(7)W8

0

X(1) = x(0)W80 + x(1)W8

1 + x(2)W82 + x(3)W8

3 + x(4)W84 + x(5)W8

5 + x(6)W86 + x(7)W8

7

X(2) = x(0)W80 + x(1)W8

2 + x(2)W84 + x(3)W8

6 + x(4)W88 + x(5)W8

10 + x(6)W812 + x(7)W8

14

X(3) = x(0)W80 + x(1)W8

3 + x(2)W86 + x(3)W8

9 + x(4)W812 + x(5)W8

15 + x(6)W818 + x(7)W8

21

X(4) = x(0)W80 + x(1)W8

4 + x(2)W88 + x(3)W8

12 + x(4)W816 + x(5)W8

20 + x(6)W824 + x(7)W8

28

X(5) = x(0)W80 + x(1)W8

5 + x(2)W810 + x(3)W8

15 + x(4)W820 + x(5)W8

25 + x(6)W830 + x(7)W8

35

X(6) = x(0)W80 + x(1)W8

6 + x(2)W812 + x(3)W8

18 + x(4)W824 + x(5)W8

30 + x(6)W836 + x(7)W8

42

X(7) = x(0)W80 + x(1)W8

7 + x(2)W814 + x(3)W8

21 + x(4)W828 + x(5)W8

35 + x(6)W842 + x(7)W8

49

1

0

)(1

)(N

n

nk

NWnxN

kXTHE 8-POINT DFT:

ד''בס

12/19/2015 24



Direct computation of the DFT is basically

inefficient because it does not exploit the symmetry

and periodicity properties of the phase factor WN. In

particular, these two properties are:

Symmetry property:

where

Periodicity property:

k

N

Nk

N WW 2/

k

N

Nk

N WW

X(7) = x(0)W80+x(1)W8

7+x(2)W86+x(3)W8

5+x(4)W84+x(5)W8

3+x(6)W82+x(7)W8

1

X(7) = x(0)W88+x(1)W8

7+x(2)W814+x(3)W8

21+x(4)W828+x(5)W8

35+x(6)W842+x(7)W8

49

N

ikn

kn

N eW

2

DSP lab 11

ד''בס

12/19/2015 25



x(7) WN

4

x(3)

x(5) WN

4

x(1)

X(7)

X(6)

X(5)

X(4)

WN0

WN0

WN6

WN4

WN2

WN0

x(6) WN

4

x(2)

x(4) WN

4

x(0)

X(3)

X(2)

X(1)

X(0)

WN0

WN0

WN6

WN4

WN2

WN0

WN7

WN6

WN5

WN4

WN3

WN2

WN1

WN0

X(7) = x(0)W80 + x(1)W8

7 + x(2)W86 + x(3)W8

13 + x(4)W84 + x(5)W8

11 + x(6)W810 + x(7)W8

17

X(7) = x(0)W88 + x(1)W8

7 + x(2)W86 + x(3)W8

5 + x(4)W84 + x(5)W8

3 + x(6)W82 + x(7)W8

1

(W88= W8

0, W813= W8

5, W811= W8

3, W810= W8

2, W817= W8

1)

X(0) + X(4)WN4

X(2) + X(6)WN4

X(0) + X(4)WN4+

X(2)WN6+ X(6)WN

10

X(1) + X(5)WN4

X(3) + X(7)WN4

X(1) + X(5)WN4+

X(3)WN6+ X(7)WN

10

ד''בס

12/19/2015 26



x(7) WN

4

x(3)

x(5) WN

4

x(1)

X(7)

X(6)

X(5)

X(4)

WN0

WN0

WN6

WN4

WN2

WN0

x(6) WN

4

x(2)

x(4) WN

4

x(0)

X(3)

X(2)

X(1)

X(0)

WN0

WN0

WN6

WN4

WN2

WN0

WN7

WN6

WN5

WN4

WN3

WN2

WN1

WN0

X(2) = x(0)W80 + x(1)W8

2 + x(2)W84 + x(3)W8

6 + x(4)W88 + x(5)W8

10 + x(6)W812 + x(7)W8

14

X(2) = x(0)W80 + x(1)W8

2 + x(2)W84 + x(3)W8

6 + x(4)W80 + x(5)W8

2 + x(6)W84 + x(7)W8

6

(W88= W8

0, W810= W8

2, W812= W8

4, W814= W8

6)

X(0) + X(4)WN0

X(2) + X(6)WN0

X(0) + X(4)WN0+

X(2)WN4+ X(6)WN

4

X(1) + X(5)WN0

X(3) + X(7)WN0

X(1) + X(5)WN0+

X(3)WN4+ X(7)WN

4

000 – 000

100 – 001

010 – 010

110 – 011

001 – 100

101 – 101

011 – 110

111 - 111

Bit reverse operations

DSP lab 12

ד''בס

12/19/2015 27

The DSP processor ‘s architecture

Requirement for DSP processors:

1. High speed input data, different interface devices;

2. Input data wide dynamic range;

3. ADD, MULT & SHIFT hardware implementation. Parallel processing;

4. Flexible processing (possibility to “jump” from one process to another);

5. Algorithm’s regularity (Operation “come back”);

DSP processors features

1. Various interface highspeed ports and timers

2. Parallel access memory architecture;

3. Three mathematical units: ALU, barrel Shifter and Multiplier with fast MAC operation (MBR = MBR + Rx * Ry);

4. Cycles, branches & interrupt fast handling. Addressing special modes;

5. Circular buffer.

ד''בס

12/19/2015 28

Data types & formats.

Data types in DSP processors algorithms:

Integer (cycles, coefficients and arrays numbers);

Real (input & output data);

Complex (applications in frequency domain);

Logic (bitwise operation).

Data format in DSP processors :

Byte – 8 bit;

Short word – 16 bit;

Normal word – 32 bit;

Instruction word – 48 bit;

Extended normal word – 40 bit;

Long word – 64 bit.

DSP lab 13

ד''בס

C vs. Assembly

DSP programs are different from traditional software tasks in two important respects. – First, the programs are usually much shorter, say, one-

hundred lines versus ten-thousand lines.

– Second, the execution speed is often a critical part of the application.

If assembly is used at all, it is restricted to short subroutines that must run with the utmost speed.

ד''בס

C vs. Assembly

Programs in C are more flexible and quick to develop.

Programs in assembly often have better performance , they run faster and use less memory, resulting in lower cost.

12/19/2015 31

DSP lab 14

ד''בס

C vs. Assembly

Which language is best for your application?

– If you need flexibility and fast development, choose C.

– use assembly if you need the best possible performance.

How complicated is the program? – If it is large and intricate use C. – If it is small and simple, assembly

may be a good choice.

Are you pushing the maximum speed of the DSP?

– If so, assembly will give you the last drop of performance from the device.

– For less demanding applications, assembly has little advantage, and you should consider using C.

How many programmers will be working together?

– If the project is large enough for more than one programmer, lean toward C and use in-line assembly only for time critical segments.

Which is more important, product cost or development cost?

– If it is product cost, choose assembly;

– If it is development cost, choose C.

What is your background? – If you are experienced in assembly

(on other microprocessors), choose assembly for your DSP.

– If your previous work is in C, choose C for your DSP.

ד''בס

12/19/2015 33

The DSP processor ‘s architecture.

“Traditional” fon Neiman architecture

Harvard architecture

CPU Memory

data & instruction

Address bus

Data bus

Program Memory instruction

only PM data bus

Data Memory

data only DM data bus CPU

DM address bus PM address bus

DSP lab 15

ד''בס

12/19/2015 34


Super Harvard architecture

I/O Controller

Data

Program Memory instruction

only PM data bus

Data Memory

data only DM data bus

CPU DM address bus PM address bus

Instruction Cache

This is SHARC DSP processor structure

ד''בס

12/19/2015 36


The ADSP-21160 hardware structure.

SERIAL PORTS

(2)

LINK PORTS

(6)

DMA

CONTROLLER

ADDR BUS

MUX

IOD

64

IOA

18

IOP

REGISTERS

6

6

6x10

4

Dual-Ported SRAM

External Port

I/O Processor

PROCESSOR

PORT I/O

PORT ADDR DATA ADDR DATA

Two Independent,

Dual-Ported Memory

Blocks

ADDR DATA ADDR DATA

MULTIPROCESSOR

32

64

HOST PORT

INTERFACE

PM Address Bus 32

DM Address Bus 32

PM Data Bus 16/32/40/48/64

DM Data Bus 32/40 64

INSTRUCTION

CACHE 32 x 48-Bit

DA G 2 8 x 4 x 32

DA G 1 8 x 4 x 32

Core Processor

PROGRAM

SEQUENCER

TIMER

Connect

Bus

(PX)

7 JTAG

Test &

Emulation

P M D

D M D

E P D

I O D

BL

OC

K 0

BL

OC

K 1

DATA BUS

MUX

MULTIPLIER BARREL

SHIFTER ALU

DATA

REGISTER

FILE

16 x 40-Bit

DSP lab 16

ד''בס

12/19/2015 37


100 MHz - 600 MFLOPS- SIMD Core

1024 point, complex FFT benchmark: 90 us

4 Mbits on chip SRAM

14 zero overhead DMA channels

Sustained 700 Mbyte/sec over IOP bus

Two 50 mbit/sec Synchronous Serial Ports

Six 100 Mbyte/sec link ports

64 bit synchronous external port

Cluster multiprocessing support

ADSP-21160 Features

ד''בס

12/19/2015 38

Peak (technical) performance of microprocessor: Maximum theoretical microprocessor’s speed in ideal conditions. It’s

defined by number of calculating operation which had done in some time.

Real (sustained) performance of microprocessor: Real microprocessor’s speed in real conditions. The real performance is

calculated by execution of some popular programs. (like FIR,IIR or FFT).

The methods for computer performance measurement



DSP lab 17

ד''בס

12/19/2015 39


Pipe-Line command execution:

Instruction fetching (a);

Decoding (b);

Execution (c).

n-1 operation

n operation

n+1 operation

a b c

a b c

a b c

ד''בס

12/19/2015 40

SHARC instruction set

SHARC programming model.

SHARC assembly language.

SHARC data operations.

SHARC flow of control.

DSP lab 18

ד''בס

SHARC programming model

Register files:

R0-R15 (aliased as F0-F15 for floating point)

Status registers.

Loop registers.

Data address generator registers.

ד''בס

12/19/2015 42

SHARC assembly language

R1=DM(M0,I0), R2=PM(M8,I8); // comment

label: R3=R1+R2;

data memory access program memory access

Algebraic notation terminated by semicolon:

DSP lab 19

ד''בס

12/19/2015 43

Simple ALU Instructions

Rn = Rx + Ry Fn = Fx + Fy

Rx = Rx – Ry Fn = Fx - Fy

Rn = Rx + Ry + CI (Carry In) Fn = ABS(Fx + Fy)

Rn = Rx - Ry + CI - 1 Fn = ABS(Fx – Fy)

Rn = (Rx + Ry)/2 Fn = (Fx + Fy)/2

COMP(Rx, Ry) COMP(Fx, Fy)

Rn = Rx + CI – 1 Fn = - Fx

Rn = Rx + 1 Fn= ABS Fx

Rn = Rx – 1 Fn= PASS Fx

Rn = -Rx Fn = RND Fx

Rn = ABS Rx Fn = SCALB Fx BY Ry

Rn = PASS Rx Rn = MANT Fx

Rn = Rx AND Ry Rn = LOGB Fx

Rn = Rx OR Ry Rn = FIX Fx BY Ry

Rn = NOT Rx Fn = FLOAT Rx BY Ry

Rn = MIN(Rx, Ry) Rn = TRUNC Fx

Rn = MAX(Rx, Ry) Fn = RECIPS Fx

ד''בס

12/19/2015 44

MAC instructions - mainly INTEGER

Multiply and Accumulate

Rn = Rx * Ry MRF = Rx * Ry

MRB = Rx * Ry Rn = MRF + Rx * Ry

Rn = MRB + Rx * Ry MRF = MRF + Rx * Ry

MRB = MRB + Rx * Ry Rn = MRF – Rx * Ry

Rn = MRB – Rx * Ry MRF = MRF – Rx * Ry

MRB = MRB – Rx * Ry Rn = SAT MRF

Rn = SAT MRB MRF = SAT MRF

MRB = SAT MRB Rn = RND MRF

Rn = RND MRB MRF = RND MRF

MRB = RND MRB MR = Rn

Rn = MR FLOAT – Fx * Fy

DSP lab 20

ד''בס

SHARC DSP Architecture

The machine supports both memory parallelism and operation parallelism.

Reduce the number of instructions required for common operations.

For example, the basic operation in a dot product loop can be performed in one cycle that performs two fetches, a multiplication, and an addition.

ד''בס

12/19/2015 49

Example Multi-Function Instruction

In a SingleCycle the SHARC Performs:

1(2) Multiply

1 (2) Addition

1 (2) Subtraction

1 (2) Memory Read

1 (2) Memory Write

2 Address Pointer Updates

Plus the I/O Processor Performs:

Active Serial Port Channels (2 Transmit, 2 Receive)

Active Link Ports (6)

Memory DMA

2 DMA Pointer Updates

f11=f1*f7, f3=f9+f14, f9=f9-f14, dm(i2,m0)=f13,

f7=pm(i8,m8);

DSP lab 21

ד''בס

Parallelism Restrictions on the sources of the operands when

operations are combined.

The operands going to the multiplier must come from R0 through R7 (or in the case of floating-point operands, F0 to F7), with one input coming from RO-R3/FO-F3 and the other from R4-R7/f0-f7.

The ALU operands must come from R8-R15/f8-fl5, with one operand coming from R8-Rll/f8-fll and the other from R12-R15/fl2-fl5.

performs three operations: R6 = R0 * R4, R9 = R8 + R12, RI0 = R8 - R12

ד''בס

12/19/2015 51

SHARC load/store

Load/store architecture: no memory-direct operations.

Two data address generators (DAGs): data memory.

program memory;

Must set up DAG registers to control loads/stores.

Provide indexed, modulo, bit-reverse indexing. – Bit-reversal addressing can be performed only in I0 and I8, as

controlled by the BR0 and BR8 bits in the MODE1 register.

DSP lab 22

ד''בס

12/19/2015 52

BASIC addressing

Immediate value: r0 = DM(0x20000000);

Direct load: r0 = DM(_a); // Loads contents of _a

Direct store: DM(_a)= r0; // Stores R0 at _a

Base-Plus-Offset

R0 = DM(M1, I0); // Loads from location I0 + M1

//M and I are registers from the DAG register file

ד''בס

12/19/2015 53


Circular buffer

DSP lab 23

ד''בס

Circular buffer

A circular buffer is an array of n elements; when the n + 1th element is referenced, the reference goes to buffer location 0, wrapping around from the end to the beginning of the buffer.

12/19/2015 54

ד''בס

12/19/2015 55

DAGs registers

I0

I1

I2

I3

I4

I5

I6

I7

M0

M1

M2

M3

M4

M5

M6

M7

L0

L1

L2

L3

L4

L5

L6

L7

B0

B1

B2

B3

B4

B5

B6

B7

DSP lab 24

ד''בס

12/19/2015 56


I register holds start address.

M register/immediate holds modifier value. r0 = DM(I3,M3) // Load

DM(I2,1) = r1 // Store

Circular buffer: I register is buffer start index, B is buffer base address.

Allows transmission two values of data to/from memory per cycle:

f0 = DM(I0,M0), f1 = PM(I9,M8);

Compiler allows to programmer to define which memory values are stored in.

ד''בס

12/19/2015 57


M6 = 1;

R0 = dm(I4, M6); // post-modify

// means: R0 = dm(I4), and then I4 = I4 + M6

// However:

R0 = dm(M6, I4); // offset index only

// means: R0 = dm(M6 +I4), and still keeps I4 = I4

DSP lab 25

ד''בס

12/19/2015 58


B4 = 4000;

L4 = 0; // set to 0

I4 = 4002;

M6 = 1;



// means R0 = dm(4002 + 1) and R1 = dm(4002 + 1)

// with I4 = 4002 still unchanged at the end of the

code



// means R0 = dm(4002) and R1 = dm(4003)

// with I4 = 4004 at the end of the code

Post-incrementing and Offset

ד''בס

12/19/2015 59


B4 = 4000;

L4 = 3;

I4 = 4002;

M6 = 1;



// means R0 = dm(4002 + 1) and R1 = dm(4002 + 1)

// with I4 = 4002 still

R0 = dm(I4, M6); // post-increment

R1 = dm(I4, M6); // post-increment

// means R0 = dm(4002) with I4 = 4003,

// however R1 = dm(4000) {4003 – 3} with I4 = 4001

Circular buffer implementation

DSP lab 26

ד''בס

12/19/2015 60

Example: C assignments

C: x = (a + b) - c;

Assembler: r0 = DM(_a) // Load a

r1 = DM(_b); // Load b

r3 = r0+r1;

r2 = DM(_c); // Load c

r3 = r3-r2;

DM(_x) = r3; // Store result in x

ד''בס

12/19/2015 61


C: y = a*(b+c);

Assembler: r1 = DM(_b); // Load b

r2 = DM(_c); // Load c

r2 = r1 + r2;

r0 = DM(_a); // Load a

r2 = r2*r0;

DM(_y) = r2; // Store result in y

DSP lab 27

ד''בס

12/19/2015 62


Shorter version using pointers: // Load b, c

r2 = DM(I1,M5), r1 = PM(I8,M13);

// load a in parallel with multiplication

r0 = r2+r1, r12 = DM(I0,M5);

r8 = r12*r0;

DM(I0,M5)= r8; // Store in y

ד''בס

12/19/2015 63


C: z = (a << 2) | (b & 15);

Assembler: r0 = DM(_a); // Load a

r0 = LSHIFT r0 by 2; // Left shift

r1 = DM(_b), r3 = 15;// Load immediate

r1 = r1 AND r3;

r0 = r1 OR r0;

DM(_z) = r0;

DSP lab 28

ד''בס

12/19/2015 64

SHARC jump

Unconditional flow of control change:

JUMP label;

Three addressing modes:

– Direct (specifies a 24-bit address in immediate );

– Indirect (supply by DAG2 data address generator);

– PC-relative (specifies an immediate value that is added to the

current PC).

All Instructions may be executed conditionally

– if EQ r1=pm(i15,0x11);

– if LE r0 = LSHIFT r0 by 2;

Conditions come from:

– arithmetic status (ASTAT)

– mode control 1 (MODE1)

– loop register

ד''בס

12/19/2015 65

Example: C if statement

C: if (a > b)

y = c + d;

else y = c - d;

Assembler: // if condition

r0 = DM(_a);

r1 = DM(_b);

COMP(r0,r1); // Compare

IF GT JUMP label;

// False block

r0 = DM(_c);

r1 = DM(_d);

r1 = r0 - r1;

DM(_y)= r1;

JUMP other; // Skip false block

// True block

label: r0 = DM(_c);

r1 = DM(_d);

r1 = r0 + r1;

DM(_y) = r1;

other: // Code after if

True version

EQ

LT

LE

AC

AV

TF

Description

ALU = 0

ALU<0

ALU≤0

ALU carry

ALU overflow

Bit test flag

Complement version

NE

GE

GT

NOT AC

NOT AV

NOT TF

DSP lab 29

ד''בס

12/19/2015 66

The best if implementation

C: if (a > b)

y = c + d;

else y = c - d;

Assembler: // Load values

r1 = DM(_a), r2 = PM(_b);

r3 = DM(_c), r4 = PM(_d);

// Compute both sum and difference

r12 = r3 + r4, r0 = r3 - r4;

// Choose which one to save

comp(r2,r1);

if GT r0 = r12;

dm(_y) = r0 // Write to y

ד''בס

DO UNTIL loops

DO UNTIL instruction provides efficient looping: LCNTR = 30, DO label UNTIL LCE;

r0 = DM(I0,M0), f2 = PM(I8,M8);

r1 = r0 - r15;

label: f4 = f2 + f3;

Loop length (16 bit) Last instruction in loop

Termination condition

The SHARC processor allows up to six nested loops

Another version of loop:

DO label UNTIL EQ;

R0 = R0-1;

label: comp(R0,R1);

LCE

NOT LCE

Loop counter expired

Loop counter not expired

DSP lab 30

ד''בס

12/19/2015 68

Example: FIR filter C:

for (n=0; n<N-M; n++){

acc = 0;

for (i=0; i<M; i++)

acc += a[i]*x[n+i];

y[n] = acc;}

ד''בס

12/19/2015 69

FIR filter assembler

// setup

I0 = _a; I8 = _x;// a[0] (DAG0), x[0] (DAG1)

r12 = 0; // f = 0;

M0 = 1; M8 = 1; // Set up increments

// Loop body

LCNTR = N, DO loopend UNTIL LCE;

// Use post-increment mode

r1 = DM(I0,M0), r2 = PM(I8,M8);

r8 = r1 * r2 (uui);

loopend: r12 = r12 + r8;

DSP lab 31

ד''בס

12/19/2015 70

Example: C main + ASM function

C: int dm c[4] = {1,2,3,4};

int pm x[7] = {1,2,3,4,5,6,7};

int dm y;

extern int fir(int dm *,int pm *);

//main

void main()

{

y = fir(c,x);

}

ד''בס

12/19/2015 71

Example: C main + ASM

function Assembler: #include <asm_sprt.h>

.SEGMENT/PM seg_pmco;

.global _fir;

.extern _c, _x, _y;

_fir: entry;

// setup

I0=_c; I8=_x; // c[0](DAG0),x[0](DAG1)

// or I0 = r4, I8 = r8

r12 = 0; // f = 0;

M0=1; M8=1; // Set up increments

// Loop body

LCNTR = 4, DO loopend UNTIL LCE;

r1 = DM(I0,M0), r2 = PM(I8,M8);

r3 = r1 * r2 (ssi);

loopend: r12 = r12 + r3;

r0 = r12; // or dm(_y)=r12;

exit;

_fir.end:

.endseg;

DSP lab 32

ד''בס

12/19/2015 72

Example: Using MAC operation

Assembler: #include <asm_sprt.h>


.global _fir;

.extern _c, _x, _y;

_fir: entry;

//setup

I0=_c; I8=_x; // c[0](DAG0),x[0](DAG1)

//or I0 = r4, I8 = r8

r12 = 0; // f = 0;

M0=1; M8=1; // Set up increments

//Loop body

LCNTR = 4, DO loopend UNTIL LCE;

r1 = DM(I0,M0), r2 = PM(I8,M8);

loopend: MRF = MRF + r1 * r2 (ssi);

r0 = MR0F;

exit;

_fir.end:

.endseg;

ד''בס

12/19/2015 73


(work with STACK)

int a,b,c,d,e,f;

extern int asm_proc( int a, int b, int c,

int d, int e );

void main()

{

a = 0xAAAAAA;

b = 0xBBBBBB;

c = 0xCCCCCC;

d = 0xDDDDDD;

e = 0xEEEEEE;

f = asm_proc(a,b,c,d,e);

}

DSP lab 33

ד''בס

12/19/2015 74


(work with STACK)

int a,b,c,d,e,f;

extern int asm_proc( int a, int b, int c,

int d, int e );

void main()

{

a = 0xAAAAAA;

b = 0xBBBBBB;

c = 0xCCCCCC;

d = 0xDDDDDD;

e = 0xEEEEEE;

f = asm_proc(a,b,c,d,e);

}

ד''בס

12/19/2015 75


(work with STACK) #include "asm_sprt.h"


.GLOBAL _asm_proc;

_asm_proc:

start:

// m7 = -1 (compiler definition)

// m6 = 1 (compiler definition)

r15 = i6;

// i6 - save C sp (stack pointer)

// i7 - asm sp (stack pointer)

i2 = r15;

modify(i2,m6);

r0 = r4;

r1 = r8;

r2 = r12;

r3 = dm(i2,m6);

// C sp + 2 (fourth argument place)

r4 = dm(i2,m6);

// C sp + 3 (fifth argument place)

r5 =0x555555;

r0 = r0 + r5;

// r0 = return()

_asm_proc.end:

.endseg;

exit;

DSP lab 34

ד''בס

Important programming reminders

Registers for C function parameters transfer: r4,r8,r12

C return with r0;

Interrupt does not occur until 2 instructions after delayed branch (needs 2 NOPs);

Some DAG register transfers are disallowed in assembler routine;

Compiler definition m7 = -1, m6 = 1

C stack pointer - i6,asm stack pointer - i7

ד''בס

SOS filters creation

% The cut off frequency normalization with fs/2. Wn=[200 1000]/(fs/2); % calculate coeffitions for Bandpass filter N = 5 [b, a] = butter(N, Wn); Convertion zero-pole-gain filter parameters to second-

order sections form [z,p,k]=butter(N, Wn); [sos,g] = zp2sos(z,p,k);

12/19/2015 80

H(z) H1(z) H2(z) H3(z) x(t) x(t) y(t) x(t) z1(t) z2(t)

DSP lab 35

ד''בס

Fixed-Point Design

Digital signal processing algorithms

– Often developed in floating point

– Later mapped into fixed point for digital hardware realization

Fixed-point digital hardware

– Lower area

– Lower power

– Lower per unit production cost

Idea

Floating-Point Algorithm

Quantization

Fixed-Point Algorithm

Code Generation

Target System

Alg

orith

m L

evel

Imple

menta

tion

Level

Range Estimation

ד''בס

Fixed point data preparation

To find minimum of a, b, input and to

normalize to minimum

To round a, b, input

Save a, b and input to files as described in

ee-83

Define buffers as integers in C code.

Prepare Fixed point simulation and compere

that with flouting point

Implement that in VisualDSP++ 12/19/2015 84

DSP lab 36

ד''בס

12/19/2015 85


DSP processors with fixed and floating point.

Fixed versus Floating:

Fixed point arithmetic operations are more simple for hardware realization;

Floating point DSP processor has more data types and commands;

Floating point advantages:

Increases accuracy;

Wide dynamic range;

Doesn’t have problem with data overflow;

Friendly for C compiler.

Fixed point advantages:

Cheaper;

Compact.

ד''בס

12/19/2015 86

Getting started

DSP lab 37

ד''בס

12/19/2015 87

ד''בס

12/19/2015 88

DSP lab 38

ד''בס

12/19/2015 89

ד''בס

12/19/2015 90

DSP lab 39

ד''בס

12/19/2015 91

ד''בס

12/19/2015 92

DSP lab 40

ד''בס

12/19/2015 93

ד''בס

12/19/2015 94

DSP lab 41

ד''בס

12/19/2015 95

ד''בס

12/19/2015 96

DSP lab 42

ד''בס

12/19/2015 97

ד''בס

12/19/2015 98

DSP lab 43

ד''בס

12/19/2015 99

ד''בס

12/19/2015 100

DSP lab 44

ד''בס

12/19/2015 101

ד''בס

12/19/2015 102

DSP lab 45

ד''בס

12/19/2015 103

ד''בס

12/19/2015 104

DSP lab 46

ד''בס

12/19/2015 105

ד''בס

12/19/2015 106

DSP lab 47

ד''בס

12/19/2015 107

ד''בס

12/19/2015 108

DSP lab 48

ד''בס

12/19/2015 109

ד''בס

12/19/2015 110

DSP lab 49

ד''בס

12/19/2015 111

ד''בס

12/19/2015 112

DSP lab 50

ד''בס

12/19/2015 113

ד''בס

12/19/2015 114

DSP lab 51

ד''בס

12/19/2015 115

ד''בס

12/19/2015 116

DSP lab 52

ד''בס

12/19/2015 117

ד''בס

12/19/2015 118

DSP lab 53

ד''בס

12/19/2015 119

ד''בס

12/19/2015 120

DSP lab 54

ד''בס

12/19/2015 121

ד''בס

12/19/2015 122

DSP lab 55

ד''בס

12/19/2015 123

ד''בס

12/19/2015 124

DSP lab 56

ד''בס

12/19/2015 125

ד''בס

12/19/2015 126

DSP lab 57

ד''בס

12/19/2015 127

ד''בס

12/19/2015 128

Paths to examples

C:\Program Files\Analog Devices\VisualDSP\211xx\examples

C:\Program Files\Analog Devices\VisualDSP\21k\examples

or

D:\Program Files\Analog Devices\VisualDSP\211xx\examples

D:\Program Files\Analog Devices\VisualDSP\21k\examples

DSP lab 58

ד''בס

Situations there can occur

overflow

Addition and Subtraction

– Overflow with twos complement integers occurs when the result of an addition or subtraction is larger the largest integer that can be represented, or smaller than the smallest integer. In fixed point representation, the largest or smallest value depends on the format of the number. I will assume Q31 in a 32 bit register for any examples that follow. In this case, a CPU with saturation arithmetic would set the result to -1 or (just below) +1 on an overflow, corresponding to the integer values 0x80000000 and 0x7FFFFFFF.

– Overflow in addition can only occur when the sign of the two numbers being added is the same. Overflow in subtraction can occur only when a negative number is subtracted from a positive number, or when a positive number is subtracted from a negative number.

12/19/2015 129

ד''בס


overflow

Arithmetic Shift

Overflow can occur when shifting a number left by 1 to n bits. In fixed point computations, left shifting is used to multiply a fixed point value by a power of two, or to change the format of a number (Q15 to Q31 for example). Again, many CPUs have saturation modes to set the output to the minimum or maximum 32 bit integer (depending on whether the original number was positive or negative). Furthermore, a common feature is an instruction that counts the number of leading ones or zeros in a number. This helps the programmer avoid overflow since the number of leading sign bits determines how large a shift can be done without causing overflow.

Overflow will not occur when right shifting a number.

12/19/2015 130

DSP lab 59

ד''בס


overflow Multiplication

– Overflow doesn’t really occur during multiplication if the result register has enough bits (32 bits if two 16 bit numbers are multiplied). But it is partly a matter of interpretation. When multiplying a fixed point value of -1 by -1 (0x8000 by 0x8000 using Q15 numbers), the result is +1. If the result is interpreted as a Q1.30 number (one integer bit and 30 fractional bits) then there is no problem. If the result is to be a Q30 number (no integer bits) then an overflow condition has occurred. And if the number was to be converted to Q31 (by shifting the result left by 1) then an overflow would occur during the left shift. The overall affect would be that -1 times -1 equals -1.

– Some CPUs have a multiplication mode that shifts the product left by one bit after a multiply operation. The reason for doing so is to create a Q31 result when two Q15 numbers are multiplied. Then if a Q15 result is desired, it can be found by storing the upper 16 bits of the result register (if the register is only 32 bits). The saturating mode automatically sets the result to 0x7FFFFFFF when the number 0x8000 is multiplied by itself, and the “shift left by one” multiplication mode is enabled.

– A very often used operation in DSP algorithms is the “multiply accumulate” or “MAC”, where a series of numbers is multiplied and added to a running sum. It is recommend not using the “left in shift by one” mode if possible when doing MACs, since this only increases the chance for overflow. A better technique is to keep the result as Q1.30, and then handle overflow if converting the final result to Q31 or Q15 (or whatever). This is also a good technique to use on CPUs without saturation modes, since the number of overflow checks can be greatly reduced some cases.

12/19/2015 131

ד''בס


overflow Division

– Overflow in division can occur when the result would have more bits than was calculated. For example, if the magnitude of the numerator is several times larger than that of the denominator, than the result must have enough bits to represent numbers larger than one. Overflow can be avoided by carefully considering the range of numbers being operated on, and calculating enough bits for the result. I have not seen a CPU that implements a saturation mode for division.

Negation

– There is one case where negation of a number causes an overflow condition. When the smallest negative number is negated, there is no way to represent the corresponding positive value in twos complement. For example, the value -1 in Q31 is 0x80000000. When this number is negated (flip the bits and add one) the result is again -1. If the saturation mode is set, then the CPU will set the result to 0x7FFFFFFF (just less than +1).

12/19/2015 132

DSP lab 60

ד''בס

Optimum Wordlength

Longer wordlength – May improve application

performance – Increases hardware cost

Shorter wordlength – May increase quantization errors

and overflows – Reduces hardware cost

Optimum wordlength – Maximize application performance

or minimize quantization error – Minimize hardware cost

Wordlength (w)

Cost c(w) Distortion d(w)

[1/performance]

Optimum

wordlength

ד''בס

Filter Implementation

Finite word-length effects (fixed point implementation)

– Coefficient quantization

– Overflow & quantization in arithmetic operations

• scaling to prevent overflow

• quantization noise statistical modeling

• limit cycle oscillations

DSP lab 61

ד''בס

Coefficient Quantization

Coefficient quantization effect on pole locations :

1. tightly spaced poles (e.g. for narrow band filters) imply high sensitivity of pole locations to coefficient quantization

2. hence preference for low-order systems (parallel/cascade)

Example: Implementation of a band-pass IIR 12-order filter

Cascade structure with 16-bit coeff. Direct form with 16-bit coeff.

ד''בס

12/19/2015 164

Introduction to DSP processors

The END

introduction to dsp processors - המחלקה להנדסת …adcomplab/introduction to dsp...

Documents