introduction to dsp processors - המחלקה להנדסת …adcomplab/introduction to dsp...
TRANSCRIPT
DSP lab 1
ד''בס
12/19/2015 1
Introduction to DSP processors
Presented by:
ד''בס
12/19/2015 2
Contents:
The modern processor’s classification;
The digital signal processing methods &
algorithms;
The D(igital) S(ignal) P(rocessing) algorithms implementation
The SHARC processor - architecture; data types & formats, C & Assembler;
Getting started.
DSP lab 2
ד''בס
12/19/2015 3
FPGA & EPLD
The modern processor’s classification.
Today chips are distributed into three groups:
ASIC’s (Application
Specific Integrated Circuits)
Chips with hardware realization
of data processing algorithms
(microprocessors & microcontrollers)
ד''בס
12/19/2015 4
The modern processor’s classification.
Microprocessors & microcontrollers
DSP microprocessors.
The processors are intended for Real-Time
Digital Signal Processing systems.
General-purpose microprocessors.
This kind of processor is intended for
computer systems: PC, workstation &
parallel supercomputer.
Microcontrollers
Very especial processors are intended for embedded systems
and in different
household devices.
DSP lab 3
ד''בס
12/19/2015 5
Review: Processor Classes
General Purpose - high performance – Pentiums, Alpha's, SPARC
– 64-128 bit word size
– Used for general purpose software
– Heavy weight OS - UNIX, NT
– Multiply layers of cache memory
– Workstations, PC's
Embedded processors and processor cores – ARM9, ARC, 486SX, Hitachi
SH7000, NEC V800
– 32 bit word size
– Single program
– Lightweight, often real-time OS
– Code and Data memory cache, DSP support
– Cellular phones, consumer electronics (e. g. CD players)
DSP processors – SHARC, BlackFin, TMS320C55x,
TMS320C67x, TMS320C64x
– 16-32 bit word size
– Single program
– Lightweight, often real-time OS
– Super Harvard Architecture Support, MAC, Circular buffer, Dual-Port RAM
– Audio, Image and Video processing, Coding and Decoding, Cellular Base Station, Adaptive Filtering, Real Time operations
Microcontrollers – PIC, AVR, HC11, ARM7, 8051,80251
– Extremely cost sensitive
– Small word size - 8 bit common
– Highest volume processors by far
– Automobiles, toasters, thermostats, ...
ד''בס
12/19/2015 7
The digital signal processing
methods & algorithms.
The analog signal processing example:
fC
fjwR
iR
fR
x(t)y(t)
11
y(t)
+
-
Ri
Rf
Cf
x(t)
f fc
DSP lab 4
ד''בס
12/19/2015 8
The digital signal processing
methods & algorithms.
The digital signal processing system:
x(t)
Anti-aliasing filter
x’(t) A/D
x’(n)
D/A y’(t)
Smoothing filter y(t)
Digital filter or digital
transform
y’(n)
N
kknxk
kCny
0
fc
|H(f)|
ד''בס
The DSP vs. ASP
5th Order Analog Sallen Key Low-Pass Filter
5th Order Digital Filter of Direct Form II
12/19/2015 9
DSP lab 5
ד''בס
12/19/2015 10
The digital signal processing
methods & algorithms.
Analog signal processing:
Cheaper;
More compact;
Power dissipation.
Digital signal processing:
More accurate;
More stable for different environments.
Analog signal processing versus Digital signal processing
ד''בס
12/19/2015 11
The digital signal processing
methods & algorithms.
Time sampling: Amplitude quantization:
The basis concepts of DSP:
D
Df
TT
f1
or 1
FTf D
2
1or 2F
x(nT) – one sample at time T;
T – sample rate (time);
fD - sampling frequency:
Niquest frequency:
F – the highest frequency of signal.
x(nT) ~ x(n) ;
Resolution:
eQ quantization error:
NR 2
12max N
Qe
N– bit number.
DSP lab 6
ד''בס
12/19/2015 12
The digital signal processing
methods & algorithms. The basis concepts of DSP:
time.algorithm
time;sample
a
T
T
a del
.lg
;
;
1
del
orithmns in af operatio- number oN
uencyr clk freq- processof
timeoperation than
fNT
CLKCPU
op
CLKCPU
opopaa
ד''בס
12/19/2015 13
The D(igital) S(ignal) P(rocessing)
algorithms implementation
The major tasks in DSP:
Filter design (Linear Filtering);
Speech detection, image recognition (Spectral Analysis);
Image & Speech compression (Timing-Frequency Analysis);
Image & signal processing (Adaptation Filtering);
Coding, median filters (Non-Linear processing);
Interpolation & decimation (Multi Speed Processing).
DSP lab 7
ד''בס
12/19/2015 14
The D(igital) S(ignal) P(rocessing)
algorithms implementation
FIR filter
IIR filter
FFT
STD
Convolution
Correlation
The very usable DSP algorithms.
ד''בס
12/19/2015 15
The D(igital) S(ignal) P(rocessing)
algorithms implementation
FIR – filter
(Finite Impulse Response)
1
0
N
i
i inxbny
+
x(n)
+
Z-1 Z-1
+
+
y(n)
b2 b1 b0
DSP lab 8
ד''בס
12/19/2015 16
The D(igital) S(ignal) P(rocessing)
algorithms implementation
IIR – filter (Infinite Impulse Response)
1
0
1
1
N
i
M
k
ki knyainxbny
x(n)
y(n)
–
+
–
+
Z-1 Z-1
+
+
b2 b1 b0
a1 a0
ד''בס
The D(igital) S(ignal) P(rocessing)
algorithms implementation
12/19/2015 18
Discrete correlation
MmmngmfN
nyM
m
0 1
0
Discrete autocorrelation
MmmnfmfN
nyM
m
0 1
0
DSP lab 9
ד''בס
The D(igital) S(ignal) P(rocessing)
algorithms implementation
12/19/2015 20
Discrete convolution
M
Mm
M
Mm
mgmnfmngmfngnfny *
Circular discrete convolution
mngkMmfnyM
m
N
Nk
1
0
STD – standard deviation
1
0
21 N
i
iN xxN
S
ד''בס
12/19/2015 21
The D(igital) S(ignal) P(rocessing)
algorithms implementation
1,...,0,1
)(
21
0
NkenxN
kX N
iknN
n
Discrete Fourier Transform
N
ikn
kn
N eW
2
DSP lab 10
ד''בס
12/19/2015 23
The D(igital) S(ignal) P(rocessing)
algorithms implementation
X(0) = x(0)W80 + x(1)W8
0 + x(2)W80 + x(3)W8
0 + x(4)W80 + x(5)W8
0 + x(6)W80 + x(7)W8
0
X(1) = x(0)W80 + x(1)W8
1 + x(2)W82 + x(3)W8
3 + x(4)W84 + x(5)W8
5 + x(6)W86 + x(7)W8
7
X(2) = x(0)W80 + x(1)W8
2 + x(2)W84 + x(3)W8
6 + x(4)W88 + x(5)W8
10 + x(6)W812 + x(7)W8
14
X(3) = x(0)W80 + x(1)W8
3 + x(2)W86 + x(3)W8
9 + x(4)W812 + x(5)W8
15 + x(6)W818 + x(7)W8
21
X(4) = x(0)W80 + x(1)W8
4 + x(2)W88 + x(3)W8
12 + x(4)W816 + x(5)W8
20 + x(6)W824 + x(7)W8
28
X(5) = x(0)W80 + x(1)W8
5 + x(2)W810 + x(3)W8
15 + x(4)W820 + x(5)W8
25 + x(6)W830 + x(7)W8
35
X(6) = x(0)W80 + x(1)W8
6 + x(2)W812 + x(3)W8
18 + x(4)W824 + x(5)W8
30 + x(6)W836 + x(7)W8
42
X(7) = x(0)W80 + x(1)W8
7 + x(2)W814 + x(3)W8
21 + x(4)W828 + x(5)W8
35 + x(6)W842 + x(7)W8
49
1
0
)(1
)(N
n
nk
NWnxN
kXTHE 8-POINT DFT:
ד''בס
12/19/2015 24
The D(igital) S(ignal) P(rocessing)
algorithms implementation
Direct computation of the DFT is basically
inefficient because it does not exploit the symmetry
and periodicity properties of the phase factor WN. In
particular, these two properties are:
Symmetry property:
where
Periodicity property:
k
N
Nk
N WW 2/
k
N
Nk
N WW
X(7) = x(0)W80+x(1)W8
7+x(2)W86+x(3)W8
5+x(4)W84+x(5)W8
3+x(6)W82+x(7)W8
1
X(7) = x(0)W88+x(1)W8
7+x(2)W814+x(3)W8
21+x(4)W828+x(5)W8
35+x(6)W842+x(7)W8
49
N
ikn
kn
N eW
2
DSP lab 11
ד''בס
12/19/2015 25
The D(igital) S(ignal) P(rocessing)
algorithms implementation
x(7) WN
4
x(3)
x(5) WN
4
x(1)
X(7)
X(6)
X(5)
X(4)
WN0
WN0
WN6
WN4
WN2
WN0
x(6) WN
4
x(2)
x(4) WN
4
x(0)
X(3)
X(2)
X(1)
X(0)
WN0
WN0
WN6
WN4
WN2
WN0
WN7
WN6
WN5
WN4
WN3
WN2
WN1
WN0
X(7) = x(0)W80 + x(1)W8
7 + x(2)W86 + x(3)W8
13 + x(4)W84 + x(5)W8
11 + x(6)W810 + x(7)W8
17
X(7) = x(0)W88 + x(1)W8
7 + x(2)W86 + x(3)W8
5 + x(4)W84 + x(5)W8
3 + x(6)W82 + x(7)W8
1
(W88= W8
0, W813= W8
5, W811= W8
3, W810= W8
2, W817= W8
1)
X(0) + X(4)WN4
X(2) + X(6)WN4
X(0) + X(4)WN4+
X(2)WN6+ X(6)WN
10
X(1) + X(5)WN4
X(3) + X(7)WN4
X(1) + X(5)WN4+
X(3)WN6+ X(7)WN
10
ד''בס
12/19/2015 26
The D(igital) S(ignal) P(rocessing)
algorithms implementation
x(7) WN
4
x(3)
x(5) WN
4
x(1)
X(7)
X(6)
X(5)
X(4)
WN0
WN0
WN6
WN4
WN2
WN0
x(6) WN
4
x(2)
x(4) WN
4
x(0)
X(3)
X(2)
X(1)
X(0)
WN0
WN0
WN6
WN4
WN2
WN0
WN7
WN6
WN5
WN4
WN3
WN2
WN1
WN0
X(2) = x(0)W80 + x(1)W8
2 + x(2)W84 + x(3)W8
6 + x(4)W88 + x(5)W8
10 + x(6)W812 + x(7)W8
14
X(2) = x(0)W80 + x(1)W8
2 + x(2)W84 + x(3)W8
6 + x(4)W80 + x(5)W8
2 + x(6)W84 + x(7)W8
6
(W88= W8
0, W810= W8
2, W812= W8
4, W814= W8
6)
X(0) + X(4)WN0
X(2) + X(6)WN0
X(0) + X(4)WN0+
X(2)WN4+ X(6)WN
4
X(1) + X(5)WN0
X(3) + X(7)WN0
X(1) + X(5)WN0+
X(3)WN4+ X(7)WN
4
000 – 000
100 – 001
010 – 010
110 – 011
001 – 100
101 – 101
011 – 110
111 - 111
Bit reverse operations
DSP lab 12
ד''בס
12/19/2015 27
The DSP processor ‘s architecture
Requirement for DSP processors:
1. High speed input data, different interface devices;
2. Input data wide dynamic range;
3. ADD, MULT & SHIFT hardware implementation. Parallel processing;
4. Flexible processing (possibility to “jump” from one process to another);
5. Algorithm’s regularity (Operation “come back”);
DSP processors features
1. Various interface highspeed ports and timers
2. Parallel access memory architecture;
3. Three mathematical units: ALU, barrel Shifter and Multiplier with fast MAC operation (MBR = MBR + Rx * Ry);
4. Cycles, branches & interrupt fast handling. Addressing special modes;
5. Circular buffer.
ד''בס
12/19/2015 28
Data types & formats.
Data types in DSP processors algorithms:
Integer (cycles, coefficients and arrays numbers);
Real (input & output data);
Complex (applications in frequency domain);
Logic (bitwise operation).
Data format in DSP processors :
Byte – 8 bit;
Short word – 16 bit;
Normal word – 32 bit;
Instruction word – 48 bit;
Extended normal word – 40 bit;
Long word – 64 bit.
DSP lab 13
ד''בס
C vs. Assembly
DSP programs are different from traditional software tasks in two important respects. – First, the programs are usually much shorter, say, one-
hundred lines versus ten-thousand lines.
– Second, the execution speed is often a critical part of the application.
If assembly is used at all, it is restricted to short subroutines that must run with the utmost speed.
ד''בס
C vs. Assembly
Programs in C are more flexible and quick to develop.
Programs in assembly often have better performance , they run faster and use less memory, resulting in lower cost.
12/19/2015 31
DSP lab 14
ד''בס
C vs. Assembly
Which language is best for your application?
– If you need flexibility and fast development, choose C.
– use assembly if you need the best possible performance.
How complicated is the program? – If it is large and intricate use C. – If it is small and simple, assembly
may be a good choice.
Are you pushing the maximum speed of the DSP?
– If so, assembly will give you the last drop of performance from the device.
– For less demanding applications, assembly has little advantage, and you should consider using C.
How many programmers will be working together?
– If the project is large enough for more than one programmer, lean toward C and use in-line assembly only for time critical segments.
Which is more important, product cost or development cost?
– If it is product cost, choose assembly;
– If it is development cost, choose C.
What is your background? – If you are experienced in assembly
(on other microprocessors), choose assembly for your DSP.
– If your previous work is in C, choose C for your DSP.
ד''בס
12/19/2015 33
The DSP processor ‘s architecture.
“Traditional” fon Neiman architecture
Harvard architecture
CPU Memory
data & instruction
Address bus
Data bus
Program Memory instruction
only PM data bus
Data Memory
data only DM data bus CPU
DM address bus PM address bus
DSP lab 15
ד''בס
12/19/2015 34
The DSP processor ‘s architecture.
Super Harvard architecture
I/O Controller
Data
Program Memory instruction
only PM data bus
Data Memory
data only DM data bus
CPU DM address bus PM address bus
Instruction Cache
This is SHARC DSP processor structure
ד''בס
12/19/2015 36
The DSP processor ‘s architecture.
The ADSP-21160 hardware structure.
SERIAL PORTS
(2)
LINK PORTS
(6)
DMA
CONTROLLER
ADDR BUS
MUX
IOD
64
IOA
18
IOP
REGISTERS
6
6
6x10
4
Dual-Ported SRAM
External Port
I/O Processor
PROCESSOR
PORT I/O
PORT ADDR DATA ADDR DATA
Two Independent,
Dual-Ported Memory
Blocks
ADDR DATA ADDR DATA
MULTIPROCESSOR
32
64
HOST PORT
INTERFACE
PM Address Bus 32
DM Address Bus 32
PM Data Bus 16/32/40/48/64
DM Data Bus 32/40 64
INSTRUCTION
CACHE 32 x 48-Bit
DA G 2 8 x 4 x 32
DA G 1 8 x 4 x 32
Core Processor
PROGRAM
SEQUENCER
TIMER
Connect
Bus
(PX)
7 JTAG
Test &
Emulation
P M D
D M D
E P D
I O D
BL
OC
K 0
BL
OC
K 1
DATA BUS
MUX
MULTIPLIER BARREL
SHIFTER ALU
DATA
REGISTER
FILE
16 x 40-Bit
DSP lab 16
ד''בס
12/19/2015 37
The DSP processor ‘s architecture.
100 MHz - 600 MFLOPS- SIMD Core
1024 point, complex FFT benchmark: 90 us
4 Mbits on chip SRAM
14 zero overhead DMA channels
Sustained 700 Mbyte/sec over IOP bus
Two 50 mbit/sec Synchronous Serial Ports
Six 100 Mbyte/sec link ports
64 bit synchronous external port
Cluster multiprocessing support
ADSP-21160 Features
ד''בס
12/19/2015 38
Peak (technical) performance of microprocessor: Maximum theoretical microprocessor’s speed in ideal conditions. It’s
defined by number of calculating operation which had done in some time.
Real (sustained) performance of microprocessor: Real microprocessor’s speed in real conditions. The real performance is
calculated by execution of some popular programs. (like FIR,IIR or FFT).
The methods for computer performance measurement
The digital signal processing
methods & algorithms.
DSP lab 17
ד''בס
12/19/2015 39
The DSP processor ‘s architecture.
Pipe-Line command execution:
Instruction fetching (a);
Decoding (b);
Execution (c).
n-1 operation
n operation
n+1 operation
a b c
a b c
a b c
ד''בס
12/19/2015 40
SHARC instruction set
SHARC programming model.
SHARC assembly language.
SHARC data operations.
SHARC flow of control.
DSP lab 18
ד''בס
SHARC programming model
Register files:
R0-R15 (aliased as F0-F15 for floating point)
Status registers.
Loop registers.
Data address generator registers.
ד''בס
12/19/2015 42
SHARC assembly language
R1=DM(M0,I0), R2=PM(M8,I8); // comment
label: R3=R1+R2;
data memory access program memory access
Algebraic notation terminated by semicolon:
DSP lab 19
ד''בס
12/19/2015 43
Simple ALU Instructions
Rn = Rx + Ry Fn = Fx + Fy
Rx = Rx – Ry Fn = Fx - Fy
Rn = Rx + Ry + CI (Carry In) Fn = ABS(Fx + Fy)
Rn = Rx - Ry + CI - 1 Fn = ABS(Fx – Fy)
Rn = (Rx + Ry)/2 Fn = (Fx + Fy)/2
COMP(Rx, Ry) COMP(Fx, Fy)
Rn = Rx + CI – 1 Fn = - Fx
Rn = Rx + 1 Fn= ABS Fx
Rn = Rx – 1 Fn= PASS Fx
Rn = -Rx Fn = RND Fx
Rn = ABS Rx Fn = SCALB Fx BY Ry
Rn = PASS Rx Rn = MANT Fx
Rn = Rx AND Ry Rn = LOGB Fx
Rn = Rx OR Ry Rn = FIX Fx BY Ry
Rn = NOT Rx Fn = FLOAT Rx BY Ry
Rn = MIN(Rx, Ry) Rn = TRUNC Fx
Rn = MAX(Rx, Ry) Fn = RECIPS Fx
ד''בס
12/19/2015 44
MAC instructions - mainly INTEGER
Multiply and Accumulate
Rn = Rx * Ry MRF = Rx * Ry
MRB = Rx * Ry Rn = MRF + Rx * Ry
Rn = MRB + Rx * Ry MRF = MRF + Rx * Ry
MRB = MRB + Rx * Ry Rn = MRF – Rx * Ry
Rn = MRB – Rx * Ry MRF = MRF – Rx * Ry
MRB = MRB – Rx * Ry Rn = SAT MRF
Rn = SAT MRB MRF = SAT MRF
MRB = SAT MRB Rn = RND MRF
Rn = RND MRB MRF = RND MRF
MRB = RND MRB MR = Rn
Rn = MR FLOAT – Fx * Fy
DSP lab 20
ד''בס
SHARC DSP Architecture
The machine supports both memory parallelism and operation parallelism.
Reduce the number of instructions required for common operations.
For example, the basic operation in a dot product loop can be performed in one cycle that performs two fetches, a multiplication, and an addition.
ד''בס
12/19/2015 49
Example Multi-Function Instruction
In a SingleCycle the SHARC Performs:
1(2) Multiply
1 (2) Addition
1 (2) Subtraction
1 (2) Memory Read
1 (2) Memory Write
2 Address Pointer Updates
Plus the I/O Processor Performs:
Active Serial Port Channels (2 Transmit, 2 Receive)
Active Link Ports (6)
Memory DMA
2 DMA Pointer Updates
f11=f1*f7, f3=f9+f14, f9=f9-f14, dm(i2,m0)=f13,
f7=pm(i8,m8);
DSP lab 21
ד''בס
Parallelism Restrictions on the sources of the operands when
operations are combined.
The operands going to the multiplier must come from R0 through R7 (or in the case of floating-point operands, F0 to F7), with one input coming from RO-R3/FO-F3 and the other from R4-R7/f0-f7.
The ALU operands must come from R8-R15/f8-fl5, with one operand coming from R8-Rll/f8-fll and the other from R12-R15/fl2-fl5.
performs three operations: R6 = R0 * R4, R9 = R8 + R12, RI0 = R8 - R12
ד''בס
12/19/2015 51
SHARC load/store
Load/store architecture: no memory-direct operations.
Two data address generators (DAGs): data memory.
program memory;
Must set up DAG registers to control loads/stores.
Provide indexed, modulo, bit-reverse indexing. – Bit-reversal addressing can be performed only in I0 and I8, as
controlled by the BR0 and BR8 bits in the MODE1 register.
DSP lab 22
ד''בס
12/19/2015 52
BASIC addressing
Immediate value: r0 = DM(0x20000000);
Direct load: r0 = DM(_a); // Loads contents of _a
Direct store: DM(_a)= r0; // Stores R0 at _a
Base-Plus-Offset
R0 = DM(M1, I0); // Loads from location I0 + M1
//M and I are registers from the DAG register file
ד''בס
12/19/2015 53
The DSP processor ‘s architecture.
Circular buffer
DSP lab 23
ד''בס
Circular buffer
A circular buffer is an array of n elements; when the n + 1th element is referenced, the reference goes to buffer location 0, wrapping around from the end to the beginning of the buffer.
12/19/2015 54
ד''בס
12/19/2015 55
DAGs registers
I0
I1
I2
I3
I4
I5
I6
I7
M0
M1
M2
M3
M4
M5
M6
M7
L0
L1
L2
L3
L4
L5
L6
L7
B0
B1
B2
B3
B4
B5
B6
B7
DSP lab 24
ד''בס
12/19/2015 56
SHARC assembly language
I register holds start address.
M register/immediate holds modifier value. r0 = DM(I3,M3) // Load
DM(I2,1) = r1 // Store
Circular buffer: I register is buffer start index, B is buffer base address.
Allows transmission two values of data to/from memory per cycle:
f0 = DM(I0,M0), f1 = PM(I9,M8);
Compiler allows to programmer to define which memory values are stored in.
ד''בס
12/19/2015 57
SHARC assembly language
M6 = 1;
R0 = dm(I4, M6); // post-modify
// means: R0 = dm(I4), and then I4 = I4 + M6
// However:
R0 = dm(M6, I4); // offset index only
// means: R0 = dm(M6 +I4), and still keeps I4 = I4
DSP lab 25
ד''בס
12/19/2015 58
SHARC assembly language
B4 = 4000;
L4 = 0; // set to 0
I4 = 4002;
M6 = 1;
R0 = dm(M6, I4); // offset index only
R1 = dm(M6, I4); // offset index only
// means R0 = dm(4002 + 1) and R1 = dm(4002 + 1)
// with I4 = 4002 still unchanged at the end of the
code
R0 = dm(I4, M6); // post-modify
R1 = dm(I4, M6); // post-modify
// means R0 = dm(4002) and R1 = dm(4003)
// with I4 = 4004 at the end of the code
Post-incrementing and Offset
ד''בס
12/19/2015 59
SHARC assembly language
B4 = 4000;
L4 = 3;
I4 = 4002;
M6 = 1;
R0 = dm(M6, I4); // offset index only
R1 = dm(M6, I4); // offset index only
// means R0 = dm(4002 + 1) and R1 = dm(4002 + 1)
// with I4 = 4002 still
R0 = dm(I4, M6); // post-increment
R1 = dm(I4, M6); // post-increment
// means R0 = dm(4002) with I4 = 4003,
// however R1 = dm(4000) {4003 – 3} with I4 = 4001
Circular buffer implementation
DSP lab 26
ד''בס
12/19/2015 60
Example: C assignments
C: x = (a + b) - c;
Assembler: r0 = DM(_a) // Load a
r1 = DM(_b); // Load b
r3 = r0+r1;
r2 = DM(_c); // Load c
r3 = r3-r2;
DM(_x) = r3; // Store result in x
ד''בס
12/19/2015 61
Example: C assignments
C: y = a*(b+c);
Assembler: r1 = DM(_b); // Load b
r2 = DM(_c); // Load c
r2 = r1 + r2;
r0 = DM(_a); // Load a
r2 = r2*r0;
DM(_y) = r2; // Store result in y
DSP lab 27
ד''בס
12/19/2015 62
Example: C assignments
Shorter version using pointers: // Load b, c
r2 = DM(I1,M5), r1 = PM(I8,M13);
// load a in parallel with multiplication
r0 = r2+r1, r12 = DM(I0,M5);
r8 = r12*r0;
DM(I0,M5)= r8; // Store in y
ד''בס
12/19/2015 63
Example: C assignments
C: z = (a << 2) | (b & 15);
Assembler: r0 = DM(_a); // Load a
r0 = LSHIFT r0 by 2; // Left shift
r1 = DM(_b), r3 = 15;// Load immediate
r1 = r1 AND r3;
r0 = r1 OR r0;
DM(_z) = r0;
DSP lab 28
ד''בס
12/19/2015 64
SHARC jump
Unconditional flow of control change:
JUMP label;
Three addressing modes:
– Direct (specifies a 24-bit address in immediate );
– Indirect (supply by DAG2 data address generator);
– PC-relative (specifies an immediate value that is added to the
current PC).
All Instructions may be executed conditionally
– if EQ r1=pm(i15,0x11);
– if LE r0 = LSHIFT r0 by 2;
Conditions come from:
– arithmetic status (ASTAT)
– mode control 1 (MODE1)
– loop register
ד''בס
12/19/2015 65
Example: C if statement
C: if (a > b)
y = c + d;
else y = c - d;
Assembler: // if condition
r0 = DM(_a);
r1 = DM(_b);
COMP(r0,r1); // Compare
IF GT JUMP label;
// False block
r0 = DM(_c);
r1 = DM(_d);
r1 = r0 - r1;
DM(_y)= r1;
JUMP other; // Skip false block
// True block
label: r0 = DM(_c);
r1 = DM(_d);
r1 = r0 + r1;
DM(_y) = r1;
other: // Code after if
True version
EQ
LT
LE
AC
AV
TF
Description
ALU = 0
ALU<0
ALU≤0
ALU carry
ALU overflow
Bit test flag
Complement version
NE
GE
GT
NOT AC
NOT AV
NOT TF
DSP lab 29
ד''בס
12/19/2015 66
The best if implementation
C: if (a > b)
y = c + d;
else y = c - d;
Assembler: // Load values
r1 = DM(_a), r2 = PM(_b);
r3 = DM(_c), r4 = PM(_d);
// Compute both sum and difference
r12 = r3 + r4, r0 = r3 - r4;
// Choose which one to save
comp(r2,r1);
if GT r0 = r12;
dm(_y) = r0 // Write to y
ד''בס
DO UNTIL loops
DO UNTIL instruction provides efficient looping: LCNTR = 30, DO label UNTIL LCE;
r0 = DM(I0,M0), f2 = PM(I8,M8);
r1 = r0 - r15;
label: f4 = f2 + f3;
Loop length (16 bit) Last instruction in loop
Termination condition
The SHARC processor allows up to six nested loops
Another version of loop:
DO label UNTIL EQ;
R0 = R0-1;
label: comp(R0,R1);
LCE
NOT LCE
Loop counter expired
Loop counter not expired
DSP lab 30
ד''בס
12/19/2015 68
Example: FIR filter C:
for (n=0; n<N-M; n++){
acc = 0;
for (i=0; i<M; i++)
acc += a[i]*x[n+i];
y[n] = acc;}
ד''בס
12/19/2015 69
FIR filter assembler
// setup
I0 = _a; I8 = _x;// a[0] (DAG0), x[0] (DAG1)
r12 = 0; // f = 0;
M0 = 1; M8 = 1; // Set up increments
// Loop body
LCNTR = N, DO loopend UNTIL LCE;
// Use post-increment mode
r1 = DM(I0,M0), r2 = PM(I8,M8);
r8 = r1 * r2 (uui);
loopend: r12 = r12 + r8;
DSP lab 31
ד''בס
12/19/2015 70
Example: C main + ASM function
C: int dm c[4] = {1,2,3,4};
int pm x[7] = {1,2,3,4,5,6,7};
int dm y;
extern int fir(int dm *,int pm *);
//main
void main()
{
y = fir(c,x);
}
ד''בס
12/19/2015 71
Example: C main + ASM
function Assembler: #include <asm_sprt.h>
.SEGMENT/PM seg_pmco;
.global _fir;
.extern _c, _x, _y;
_fir: entry;
// setup
I0=_c; I8=_x; // c[0](DAG0),x[0](DAG1)
// or I0 = r4, I8 = r8
r12 = 0; // f = 0;
M0=1; M8=1; // Set up increments
// Loop body
LCNTR = 4, DO loopend UNTIL LCE;
r1 = DM(I0,M0), r2 = PM(I8,M8);
r3 = r1 * r2 (ssi);
loopend: r12 = r12 + r3;
r0 = r12; // or dm(_y)=r12;
exit;
_fir.end:
.endseg;
DSP lab 32
ד''בס
12/19/2015 72
Example: Using MAC operation
Assembler: #include <asm_sprt.h>
.SEGMENT/PM seg_pmco;
.global _fir;
.extern _c, _x, _y;
_fir: entry;
//setup
I0=_c; I8=_x; // c[0](DAG0),x[0](DAG1)
//or I0 = r4, I8 = r8
r12 = 0; // f = 0;
M0=1; M8=1; // Set up increments
//Loop body
LCNTR = 4, DO loopend UNTIL LCE;
r1 = DM(I0,M0), r2 = PM(I8,M8);
loopend: MRF = MRF + r1 * r2 (ssi);
r0 = MR0F;
exit;
_fir.end:
.endseg;
ד''בס
12/19/2015 73
Example: C main + ASM function
(work with STACK)
int a,b,c,d,e,f;
extern int asm_proc( int a, int b, int c,
int d, int e );
void main()
{
a = 0xAAAAAA;
b = 0xBBBBBB;
c = 0xCCCCCC;
d = 0xDDDDDD;
e = 0xEEEEEE;
f = asm_proc(a,b,c,d,e);
}
DSP lab 33
ד''בס
12/19/2015 74
Example: C main + ASM function
(work with STACK)
int a,b,c,d,e,f;
extern int asm_proc( int a, int b, int c,
int d, int e );
void main()
{
a = 0xAAAAAA;
b = 0xBBBBBB;
c = 0xCCCCCC;
d = 0xDDDDDD;
e = 0xEEEEEE;
f = asm_proc(a,b,c,d,e);
}
ד''בס
12/19/2015 75
Example: C main + ASM function
(work with STACK) #include "asm_sprt.h"
.SEGMENT/PM seg_pmco;
.GLOBAL _asm_proc;
_asm_proc:
start:
// m7 = -1 (compiler definition)
// m6 = 1 (compiler definition)
r15 = i6;
// i6 - save C sp (stack pointer)
// i7 - asm sp (stack pointer)
i2 = r15;
modify(i2,m6);
r0 = r4;
r1 = r8;
r2 = r12;
r3 = dm(i2,m6);
// C sp + 2 (fourth argument place)
r4 = dm(i2,m6);
// C sp + 3 (fifth argument place)
r5 =0x555555;
r0 = r0 + r5;
// r0 = return()
_asm_proc.end:
.endseg;
exit;
DSP lab 34
ד''בס
Important programming reminders
Registers for C function parameters transfer: r4,r8,r12
C return with r0;
Interrupt does not occur until 2 instructions after delayed branch (needs 2 NOPs);
Some DAG register transfers are disallowed in assembler routine;
Compiler definition m7 = -1, m6 = 1
C stack pointer - i6,asm stack pointer - i7
ד''בס
SOS filters creation
% The cut off frequency normalization with fs/2. Wn=[200 1000]/(fs/2); % calculate coeffitions for Bandpass filter N = 5 [b, a] = butter(N, Wn); Convertion zero-pole-gain filter parameters to second-
order sections form [z,p,k]=butter(N, Wn); [sos,g] = zp2sos(z,p,k);
12/19/2015 80
H(z) H1(z) H2(z) H3(z) x(t) x(t) y(t) x(t) z1(t) z2(t)
DSP lab 35
ד''בס
Fixed-Point Design
Digital signal processing algorithms
– Often developed in floating point
– Later mapped into fixed point for digital hardware realization
Fixed-point digital hardware
– Lower area
– Lower power
– Lower per unit production cost
Idea
Floating-Point Algorithm
Quantization
Fixed-Point Algorithm
Code Generation
Target System
Alg
orith
m L
evel
Imple
menta
tion
Level
Range Estimation
ד''בס
Fixed point data preparation
To find minimum of a, b, input and to
normalize to minimum
To round a, b, input
Save a, b and input to files as described in
ee-83
Define buffers as integers in C code.
Prepare Fixed point simulation and compere
that with flouting point
Implement that in VisualDSP++ 12/19/2015 84
DSP lab 36
ד''בס
12/19/2015 85
The DSP processor ‘s architecture.
DSP processors with fixed and floating point.
Fixed versus Floating:
Fixed point arithmetic operations are more simple for hardware realization;
Floating point DSP processor has more data types and commands;
Floating point advantages:
Increases accuracy;
Wide dynamic range;
Doesn’t have problem with data overflow;
Friendly for C compiler.
Fixed point advantages:
Cheaper;
Compact.
ד''בס
12/19/2015 86
Getting started
DSP lab 37
ד''בס
12/19/2015 87
ד''בס
12/19/2015 88
DSP lab 38
ד''בס
12/19/2015 89
ד''בס
12/19/2015 90
DSP lab 39
ד''בס
12/19/2015 91
ד''בס
12/19/2015 92
DSP lab 40
ד''בס
12/19/2015 93
ד''בס
12/19/2015 94
DSP lab 41
ד''בס
12/19/2015 95
ד''בס
12/19/2015 96
DSP lab 42
ד''בס
12/19/2015 97
ד''בס
12/19/2015 98
DSP lab 43
ד''בס
12/19/2015 99
ד''בס
12/19/2015 100
DSP lab 44
ד''בס
12/19/2015 101
ד''בס
12/19/2015 102
DSP lab 45
ד''בס
12/19/2015 103
ד''בס
12/19/2015 104
DSP lab 46
ד''בס
12/19/2015 105
ד''בס
12/19/2015 106
DSP lab 47
ד''בס
12/19/2015 107
ד''בס
12/19/2015 108
DSP lab 48
ד''בס
12/19/2015 109
ד''בס
12/19/2015 110
DSP lab 49
ד''בס
12/19/2015 111
ד''בס
12/19/2015 112
DSP lab 50
ד''בס
12/19/2015 113
ד''בס
12/19/2015 114
DSP lab 51
ד''בס
12/19/2015 115
ד''בס
12/19/2015 116
DSP lab 52
ד''בס
12/19/2015 117
ד''בס
12/19/2015 118
DSP lab 53
ד''בס
12/19/2015 119
ד''בס
12/19/2015 120
DSP lab 54
ד''בס
12/19/2015 121
ד''בס
12/19/2015 122
DSP lab 55
ד''בס
12/19/2015 123
ד''בס
12/19/2015 124
DSP lab 56
ד''בס
12/19/2015 125
ד''בס
12/19/2015 126
DSP lab 57
ד''בס
12/19/2015 127
ד''בס
12/19/2015 128
Paths to examples
C:\Program Files\Analog Devices\VisualDSP\211xx\examples
C:\Program Files\Analog Devices\VisualDSP\21k\examples
or
D:\Program Files\Analog Devices\VisualDSP\211xx\examples
D:\Program Files\Analog Devices\VisualDSP\21k\examples
DSP lab 58
ד''בס
Situations there can occur
overflow
Addition and Subtraction
– Overflow with twos complement integers occurs when the result of an addition or subtraction is larger the largest integer that can be represented, or smaller than the smallest integer. In fixed point representation, the largest or smallest value depends on the format of the number. I will assume Q31 in a 32 bit register for any examples that follow. In this case, a CPU with saturation arithmetic would set the result to -1 or (just below) +1 on an overflow, corresponding to the integer values 0x80000000 and 0x7FFFFFFF.
– Overflow in addition can only occur when the sign of the two numbers being added is the same. Overflow in subtraction can occur only when a negative number is subtracted from a positive number, or when a positive number is subtracted from a negative number.
12/19/2015 129
ד''בס
Situations there can occur
overflow
Arithmetic Shift
Overflow can occur when shifting a number left by 1 to n bits. In fixed point computations, left shifting is used to multiply a fixed point value by a power of two, or to change the format of a number (Q15 to Q31 for example). Again, many CPUs have saturation modes to set the output to the minimum or maximum 32 bit integer (depending on whether the original number was positive or negative). Furthermore, a common feature is an instruction that counts the number of leading ones or zeros in a number. This helps the programmer avoid overflow since the number of leading sign bits determines how large a shift can be done without causing overflow.
Overflow will not occur when right shifting a number.
12/19/2015 130
DSP lab 59
ד''בס
Situations there can occur
overflow Multiplication
– Overflow doesn’t really occur during multiplication if the result register has enough bits (32 bits if two 16 bit numbers are multiplied). But it is partly a matter of interpretation. When multiplying a fixed point value of -1 by -1 (0x8000 by 0x8000 using Q15 numbers), the result is +1. If the result is interpreted as a Q1.30 number (one integer bit and 30 fractional bits) then there is no problem. If the result is to be a Q30 number (no integer bits) then an overflow condition has occurred. And if the number was to be converted to Q31 (by shifting the result left by 1) then an overflow would occur during the left shift. The overall affect would be that -1 times -1 equals -1.
– Some CPUs have a multiplication mode that shifts the product left by one bit after a multiply operation. The reason for doing so is to create a Q31 result when two Q15 numbers are multiplied. Then if a Q15 result is desired, it can be found by storing the upper 16 bits of the result register (if the register is only 32 bits). The saturating mode automatically sets the result to 0x7FFFFFFF when the number 0x8000 is multiplied by itself, and the “shift left by one” multiplication mode is enabled.
– A very often used operation in DSP algorithms is the “multiply accumulate” or “MAC”, where a series of numbers is multiplied and added to a running sum. It is recommend not using the “left in shift by one” mode if possible when doing MACs, since this only increases the chance for overflow. A better technique is to keep the result as Q1.30, and then handle overflow if converting the final result to Q31 or Q15 (or whatever). This is also a good technique to use on CPUs without saturation modes, since the number of overflow checks can be greatly reduced some cases.
12/19/2015 131
ד''בס
Situations there can occur
overflow Division
– Overflow in division can occur when the result would have more bits than was calculated. For example, if the magnitude of the numerator is several times larger than that of the denominator, than the result must have enough bits to represent numbers larger than one. Overflow can be avoided by carefully considering the range of numbers being operated on, and calculating enough bits for the result. I have not seen a CPU that implements a saturation mode for division.
Negation
– There is one case where negation of a number causes an overflow condition. When the smallest negative number is negated, there is no way to represent the corresponding positive value in twos complement. For example, the value -1 in Q31 is 0x80000000. When this number is negated (flip the bits and add one) the result is again -1. If the saturation mode is set, then the CPU will set the result to 0x7FFFFFFF (just less than +1).
12/19/2015 132
DSP lab 60
ד''בס
Optimum Wordlength
Longer wordlength – May improve application
performance – Increases hardware cost
Shorter wordlength – May increase quantization errors
and overflows – Reduces hardware cost
Optimum wordlength – Maximize application performance
or minimize quantization error – Minimize hardware cost
Wordlength (w)
Cost c(w) Distortion d(w)
[1/performance]
Optimum
wordlength
ד''בס
Filter Implementation
Finite word-length effects (fixed point implementation)
– Coefficient quantization
– Overflow & quantization in arithmetic operations
• scaling to prevent overflow
• quantization noise statistical modeling
• limit cycle oscillations
DSP lab 61
ד''בס
Coefficient Quantization
Coefficient quantization effect on pole locations :
1. tightly spaced poles (e.g. for narrow band filters) imply high sensitivity of pole locations to coefficient quantization
2. hence preference for low-order systems (parallel/cascade)
Example: Implementation of a band-pass IIR 12-order filter
Cascade structure with 16-bit coeff. Direct form with 16-bit coeff.
ד''בס
12/19/2015 164
Introduction to DSP processors
The END