ecse 436 1 dsp architecture review of basic computer architecture concepts c6000 architecture: vliw...

ECSE 436

1

DSP architecture

Review of basic computer architecture concepts

C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining

ECSE 436

2

DSP architecture



ECSE 436

3

Instruction Set Architecture (ISA)

Computers run programs made of simple operations called “instructions”

The list of instructions offered by the machine is the “instruction set”

The instruction set is what is visible to the programmer (really the compiler, although humans can directly program in “assembly language”)

Many different DSPs can share the same ISA but have different hardware (i.e. the implementation of the ISA is different)

ECSE 436

4

Instructions

Two kinds of information in a computer: instructions data

Instructions are stored as numbers, just like data

Instructions and data are stored in the memory

ECSE 436

5

Basic Computer Organization

CPU

registers

memory

storeload

PC IR

OPCODE OPERANDS

Limited numberof fast registersfor temporarystorage

Large amountof slow memoryArranged as an arrayof bytes

Instructions are loaded into an Instruction register (IR) from the address pointed to by the program counter (PC). The PC is incremented by the instruction size (in bytes) for each new instruction. E.g. PC PC + 4

ECSE 436

6

Load/Store Architecture (Reg-Reg)

CPU

registers

memory

storeload

PC IR

• Instructions can ONLY get their data and store their data from/to registers.

• The register numbers are specified in the operand fields of the instruction

• Since data is stored in memory, we need special “load” and “store” instructions for transfers between registers and memory. These two instructions are the ONLY ones allowed to access memory

ECSE 436

7

DSP architecture



ECSE 436

8

C6000 Architecture

TMS320C62x/C64x 16-bit fixed point DSP

TMS320C67x 32-bit floating point DSP Instuction set is a superset of the C62x

VLIW Architecture Very Long Instruction Word

ECSE 436

9

VLIW

VLIW is an architecture that exploits instruction level parallelism (ILP) in the code

What is ILP?

An instruction is dependent on another if it uses (produces) a value produced (used) by the other instruction

ECSE 436

10

Example

add c,d,emult b,e,a

The mult instruction must wait for the add instruction to finish before it can execute (sequential data flow)

e

ECSE 436

11

Example

add a,b,eadd c,d,fadd e,f,g

The first two adds have no data dependency and could even be switched in the code with no effect on the correctness of the answer

The first two adds could be executed in parallel if we had the hardware to do it (two adders)

+

+

+

a b

c d

ef

g

ECSE 436

12

Scheduling

Given a set of hardware resources (functional units), e.g. a number of adders, multipliers, etc…,

the process of determining which instructions can be executed in parallel and which functional units to use on any given clock cycle is called instruction scheduling

ECSE 436

13

VLIW

VLIW is an architecture that depends on the user (compiler) to do the scheduling

Instructions are packed into a very long instruction word (256 bits)

There is no scheduling hardware on the chip like on a Pentium 4 which uses hardware, or dynamic scheduling

Benefits simple hardware

Drawbacks requires sophisticated compilers code compatibility – need to recompile if you use a different

DSP, even one with the same ISA

ECSE 436

14

C6713 Architecture

ECSE 436

15

Maximum Performance

C6713 8 functional units, two MACS per cycle 225 MHz 1800 MIPS

6 of the 8 units floating point 225 MHz 1350 MFLOPS

ECSE 436

16

DSP architecture



ECSE 436

17

Addressing Modes

Load/Store must load registers from memory, process data,

store back to memory Linear (indirect addressing)

32 registers A0-A15, B0-B15 can act as pointers

*R register R contains the address of memory location where a data value is stored

ECSE 436

18

Linear Addressing

*R++(d) R contains the address. After R is used, postincrement by discplacement d (default is d = 1), -- post decrements

*++R(d) preincrement or predecrement

*+R(d) preincrement without modification

ECSE 436

19

Circular Addressing

ECSE 436

20

Circular Addressing

Address Mode Register (AMR)

ECSE 436

21

DSP architecture



ECSE 436

22

TMS320 Assemby Language

[label][:] mnemonic [operand list] [; comment]

[x] means that x is optional

label symbolic name for the address of the program line

mnemonic instruction, assembler directive, macro cannot start in column 1

operands constants: binary (e.g. 010101b), decimal, hexdecimal (e.g. 0x9f or 9fh) register names symbols defined by assembler directives

ECSE 436

23

Assembler Directives

The assembler produces COFF (common-obect file format) files

COFF files are divided into sections that contain instructions or data

Assembler directives are instructions to the assembler on how to manipulate these sections or to define constants they are not machine instructions see Section 4.1 in the text for more details

ECSE 436

24

C6000 ISA

parallelconditional execution

functional unit

ECSE 436

25

Instruction Packing

Instruction 1 ; instructions 1 and 2 Instruction 2 ; are executed sequentially Instruction 3 ; instructions 3, 4, and 5|| Instruction 4 ; are executed in parallel|| Instruction 5

VELOCITI: 1 to 8 execute packets in a fetch packet

ECSE 436

26

Sample Instructions

ADD .L1 A3,A7,A7 ;add A3+A7->A7

SUB .S1 A1,1,A1 ;subtract 1 from A1

MPY .M2 A7,B7,B6 ; mult 16LSBs of A7,B7->B6|| MPYH .M1 A7,B7,A6 ; mult 16MSBs of A7,B7->A6

LDH .D2 *B2++,B7 ; load (B2) -> B7, inc B2|| LDH .D1 *A2++,A7 ; load (A2) -> A7, inc A2

ECSE 436

27

Sample Instructions

Loop MVKL .S1 x,A4 ; move 16 LSBs of x addr->A4

MVKH .S2 x,A4 ; move 16 MSBs of x addr->A4

SUB .S1 A1,1,A1 ; decrement A1[A1] B .S2 Loop ; branch to Loop if A1 != 0

NOP 5 ; 5 NOP instructionsSTW .D1 A3, *A7 ; store A3 into (A7)

ECSE 436

28

Linear Assembly

To effectively program a DSP using assembly language, you need to do the scheduling by hand!

Need to account for the number of clock cycles each functional unit takes, etc…

Difficult, so TI has linear assembly you don’t have to schedule it, the compiler does it

for you can use CPU resources without worrying about

scheduling, register allocation, etc…

ECSE 436

29

DSP architecture



ECSE 436

30

Pipelining

Key technique to make fast CPUs

Multiple instructions are overlapped in execution

E.g. Automotive assembly line

ECSE 436

31

body (B) 1 hour

paint (P) 1 hour

Wheels (W) 1 hour

Pipelining: principle

ECSE 436

32

BobTime (h)

1

2

3

4

5

6

B1

0

P1

W1

B2

P2

W2

2 cars / 6 hours 1/3 car / hour

Pipelining: principle(II)

ECSE 436

33

BobTime (h)

1

2

3

4

5

6

B1

0

P1

W1

B2

P2

W2

Alice Bill

B3

B4

B5

B6

P3

P4

P5

W3

W4

1 car / hour (3 x speedup)

Pipelining: principle(III)

ECSE 436

34

COMB. LOGIC

cycle time

cycle time

Pipelining: principle(IV)

ECSE 436

35

Performance Gain

Pipelining a datapath m times can result in up to m times improvement in cycle time E.g. 5-stage pipelined processor is potentially 5

times faster than an unpipelined processor

In reality, this is limited to less than m because of restrictions in overlapping instructions

ECSE 436

36

5-Stage RISC Pipeline

ECSE 436

37

16-Stage C6713 Pipeline

Fetch (4 stages) calc. address, send address, wait, receive

Decode (2 stages) separate fetch packets into execute packets

Execute (10 stages) Different instructions require different number of

cycles to execute

38

Software and I/O

ECSE 436

39

Software and I/O

Code efficiency and programming techniques Loop unrolling Software pipelining

I/O considerations Interrupts DMA Block processing

ECSE 436

40

Software and I/O



ECSE 436

41

Code Efficiency

Intrinsic functions e.g. _add2, _mpy, sadd see TMS320C62x/C67x Programmers Guide

Packed data use word access to operate on 16-bit data store in

the high and low parts of a 32-bit register

ECSE 436

42

Loop Unrolling

A loop is a compact way of representing a repetitive sequence of instructions, but…

The loop condition test is overhead To remove the loop overhead, unroll the loop

(make copies of the loop code) key way of exposing parallelism !!! The compiler can now look across loop iterations

to find parallel instructions parallelism increased, but so is code size

ECSE 436

43

Example

; program A: code without unrollingMVK 4,B0

loop:LDH *A5++,A0

|| LDH *A6++,A1ADD A0,A1,A2 ;add 4 times…SUB B0,1,B0

[B0] B loop

ECSE 436

44

Example

; program B: code with unrolling onceMVK 2,B0

loop:LDH *A5++,A0

|| LDH *A6++,A1 ; add first 2 numbers

ADD A0,A1,A2…LDH *A5++,A0

|| LDH *A6++,A1 ; add other 2 numbers

ADD A0,A1,A2…SUB B0,1,B0

[B0] B loop

ECSE 436

45

Software and I/O



ECSE 436

46

Software Pipelining

Software pipelining compiler technique (don’t confuse with h/w

pipelining) Schedule multiple iterations of a loop together to fill

any empty cycles and maximize functional unit usage

-O2 –O3

ECSE 436

47

Software Pipelining

The general idea of this optimization is to uncover long sequences of statements without branch statements

Reorganize loops to interleave instructions from different iterations Dependent instructions within a single loop

iteration are then separated from one another by an entire loop body

Increases possibilities of scheduling

ECSE 436

48

Software Pipelining

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

ECSE 436

49

Software Pipelining

Advantage: yields shorter code than loop unrolling and uses fewer registers

Software pipelining is crucial for VLIW processors

Often, both software pipelining and loop unrolling are used

ECSE 436

50

Software and I/O



ECSE 436

51

Interrupts

A signal that causes the processor to suspend its current program and execute a special subroutine interrupt service routine (ISR)

Sources On-chip peripherals

timers, serial ports External

resets, external peripherals Software interrupts

arithmetic exceptions (divide by zero, overflow)

ECSE 436

52

Interrupts

ECSE 436

53

Interrupts

ECSE 436

54

Interrupts

ECSE 436

55

Software and I/O



ECSE 436

56

Direct Memory Access

Data transfer without intervention of processsor memory and CPU peripherals and CPU

DMA channel: source address destination address element count in a frame number of frames in a block

ECSE 436

57

Software and I/O



ECSE 436

58

Block Processing

ECSE 436

59

Ping-Pong Buffering

Ping-pong buffer (double buffer)

DMA channel delivers N samples of data in and out of buffers while the DSP operates on data in the current buffer

Next block, roles of the buffers are changed

ecse 436 1 dsp architecture review of basic computer architecture concepts c6000 architecture: vliw...

Documents

instructions data instructions

memory slide

instruction set

mult instruction

different slide

instruction size

instruction sc

new instruction