module 2. syllabus fixed and floating point formats code improvement constraints tms 320c64x cpu...

MODULE 2

Syllabus

Fixed and floating point formatscode improvementConstraintsTMS 320C64x CPU simple programming examples using C/assembly.

Fixed point numbers

• Fast and inexpensive implementation• Limited in the range of numbers• Susceptible to problems of overflow• Fixed-point numbers and their data types are

characterized by their -

word size in bits and

whether they are signed or unsigned

• unsigned integer

the stored number can take on any integer value from 0 to 65,535.

• signed integer

uses two's complement

allows negative numbers

it ranges from -32,768 to 32,767

• With unsigned fraction notation

65,536 levels spread uniformly between 0 and 1• the signed fraction format

allows negative numbers, equally spaced between -1 and 1

Carry and Overflow

• Carry applies to unsigned numbers — when adding or subtracting, result is incorrect.

• Overflow applies to signed numbers — when adding or subtracting, result is incorrect.

01111 + 100+

00111 111

-------- -------------

10110 1011

Overflow Carry

Sign bitCarry

Examples:

Sign bit

Data types1.Short:

it is of size 16 bits represented as 2’s complement with a range from -215 to (215 -1)

2.Int or signed int: it is of size 32 bits represented as 2’s complement with

a range from -231 to ( 231-1)3.Float: it is of size 32 bits represented as IEEE 32 bit with a

range from 2-126(1.175494x10-38) to 2+128

(3.40282346x1038)4.Double: it is of size 64 bits represented as IEEE 64 bit with a

range from 2-1022(2.22507385x10-308) to 21024(1.79769313x10308)

Floating-point representation

•The advantage over fixed-point representation is that

it can support a much wider range of values.

• The floating-point format needs slightly more storage

• The speed of floating-point operations is measured in

FLOPS.

General format of floating point number :

X= M. be

where M is the value of the significand (mantissa),

b is the base

e is the exponent.

Mantissa determines the accuracy of the number

Exponent determines the range of numbers that can be represented

Floating point numbers can be represented as:

Single precision : • called "float" in the C language family

• it is a binary format that occupies 32 bits • its significand has a precision of 24 bits

Double precision :• called "double" in the C language family

• it is a binary format that occupies 64 bits • its significand has a precision of 53 bits

Single Precision (SP):

Bit 31 represents sign bit

Bits 23 to 30 represents exponent bits

Bits 0 to 22 represents fractional bits

Numbers as small as 10-38 and as large as 10 38 can be represented

S e f

022233031

Double precision (DP) :• since 64 bits, more exponent and fractional bits are available • a pair of registers are used

Bits 0 to 31 of first register represents fractional bitsBits 0 to 19 second register also represents fractional bitsBits 20 to 30 represents exponent bitsBits 31 is the sign bit

Numbers as small as 10 -308 and as large as 10 +308 can be represented

ffes

031019203031

• Instructions ending in SP or DP represents single and double precision

• Some Floating point instructions have more latencies than fixed point instructions

Eg: MPY requires one delay

MPYSP has three delays

MPYDP requires nine delays• Single precision floating point value can be loaded into a single

register where as Double precision values need a pair of registers

A1:A0, A3:A2 ,…….. B1:B0, B3:B2 ,……………

• C6711 processor has a single precision reciprocal instruction RCPSP for performing division

Code improvement

Code written in assembly (ASM) is processor-specific.

C code can readily be ported from one platform to another.

Optimized ASM code runs faster than C and requires less memory space.

Before optimizing, make sure that the code is functional and yields correct results.

After optimizing, the code can be so reorganized and resequenced that theoptimization process makes it difficult to follow.

If a C coded algorithm is functional and its execution speed is satisfactory, there is no need to optimize further.

If the performance of the code is not adequate, use different compiler options to enable software pipelining , reduce redundant loops, and so on.

Code improvement

If the performance desired is still not achieved, you can use loop unrolling to avoid overhead in branching. This generally improves the execution

speed but increases code size.

You also can use word-wide optimization by loading/accessing 32-bit word (int) data rather than 16-bit half-word (short) data.

You can then process lower and upper 16-bit data independently

If performance is still not satisfactory, you can rewrite the time-critical section ofthe code in linear assembly, which can be optimized by the assembler

optimizer.

The profiler can be used to determine the specific function(s) that need to be optimized further.

Optimization Steps

If the performance and results of your code are satisfactory after any particular step,you are done.

1.Program in C. Build your project without Optimization

2. Use intrinsic functions when appropriate as well as the various optimization levels.

3. Use the profiler to determine/ identify the functions that may need to be further optimized. Then convert these functions in linear ASM.

4. Optimize code in ASM.

Compiler options

A C-coded program is first passed through a parser that performs preprocessing functions and generate an intermediate file (.if) which becomes the input to an optimizer.

The optimizer generates an (.opt) file which becomes the input to a code generator for further optimization and generates ASM file.

OptimizerParser code generator ASMC Code

.if .opt

The options for optimization levels:

1. -o0 optimizes the use of registers

2. -o1 performs a local optimization in addition to

optimization done by -o0.

3. -o2 performs global optimization in addition to

optimization done by -o0 and -o1.

4. -o3 performs file optimization in addition to the

optimizations done by -o0, -o1 and -o2.

-o2 and -o3 attempt to do software optimizations.

Intrinsic C functions:

• Similar to run time support library function• C intrinsic function are used to increase the efficiency of code.

• Int _mpy ( ) has an equivalent ASM instruction MPY, which multiplies 16 LSBs of a number by 16 LSBs of another number.

2. int_mpyh ( ) has an equivalent ASM instruction MPYH which multiplies 16 MSBs of a number by the 16 MSBs of another number.

3. int_mpylh ( ) has an equivalent ASM instruction MPYLH which multiplies 16 LSBs of a number by 16 MSBs of another.

4. int_mpyhl ( ) has an equivalent ASM instruction MPYHL which multiplies 16 MSBs of a number by the 16 LSBs of another.

5. Void_nassert (int) generates no code. It tells the compiler that expression declared with the asssert function is true.

6. Uint_lo (double) and Uint_hi (double) obtain low and high 32 bits of a double word.

PROCEDURE FOR CODE OPTIMIZATION

1. Use instructions in parallel so that multiple functional units can be operatedwithin the same cycle.

2. Eliminate NOPs or delay slots, placing code where the NOPs are.

3. Unroll the loop to avoid overhead with branching.

4. Use word-wide data to access a 32-bit word (int) in lieu of a 16-bit half-word(short).

5. Use software pipelining,

PROGRAMMING EXAMPLES USING CODE OPTIMIZATIONTECHNIQUES

Sum of Products with Word-Wide Data Access for Fixed-Point Implementation Using C Code//twosum.c Sum of Products with separate accumulation of even/odd terms//with word-wide data for fixed-point implementationint dotp (short a[ ], short b [ ]){

int suml, sumh, sum, i;suml = 0;sumh = 0;sum = 0;

for (i = 0; i < 200; i +=2){

suml += a[i] * b[i]; //sum of products of even termssumh += a[i + 1] * b[i + 1]; //sum of products of odd terms

}sum = suml + sumh; //final sum of odd and even termsreturn (sum);

}

//dotpintrinsic.c Sum of products with C intrinsic functions using C

for (i = 0; i < 100; i++){suml = suml + _mpy(a[i], b[i]);sumh = sumh + _mpyh(a[i], b[i]);}return (suml + sumh);

Sum of Products with Word-Wide Access for Fixed-Point Implementation Using Linear ASM Code

Sum of Products. Separate accum of even/odd termsWith word-wide data for fixed-point implementation using linear ASM

loop: LDW *aptr++, ai ;32-bit word ai

LDW *bptr++, bi ;32-bit word bi

MPY ai, bi, prodl ;lower 16-bit product

MPYH ai, bi, prodh ;higher 16-bit product

ADD prodl, suml, suml ;accum even terms

ADD prodh, sumh, sumh ;accum odd terms

SUB count, 1, count ;decrement count

[count] B loop ;branch to loop

dotpnp.asm ASM Code with no-parallel instructions for fixed-point

MVK .S1 200, A1 ;count into A1ZERO .L1 A7 ;init A7 for accum

LOOP LDH .D1 *A4++,A2 ;A2=16-bit data pointed by A4LDH .D2 *A8++,A3 ;A3=16-bit data pointed by A8NOP 4 ;4 delay slots for LDHMPY .M1 A2,A3,A6 ;product in A6NOP ;1 delay slot for MPYADD .L1 A6,A7,A7 ;accum in A7SUB .S1 A1,1,A1 ;decrement count

[A1] B .S2 LOOP ;branch to LOOPNOP 5 ;5 delay slots for B

Dot Product with Parallel Instructions for Fixed-PointImplementation Using ASM Code

twosumfix.asm ASM code for two sums of products with word-wide data for fixed-point implementation

MVK .S1 100, A1 ;count/2 into A1|| ZERO .L1 A7 ;init A7 for accum of even terms|| ZERO .L2 B7 ;init B7 for accum of odd terms

LOOP LDW .D1 *A4++,A2 ;A2=32-bit data pointed by A4|| LDW .D2 *B4++,B2 ;A3=32-bit data pointed by B4SUB .S1 A1,1,A1 ;decrement count

[A1] B .S1 LOOP ;branch to LOOP (after ADD)NOP 2 ;delay slots for both LDW and BMPY .M1x A2,B2,A6 ;lower 16-bit product in A6

|| MPYH .M2x A2,B2,B6 ;upper 16-bit product in B6NOP ;1 delay slot for MPY/MPYHADD .L1 A6,A7,A7 ;accum even terms in A7

|| ADD .L2 B6,B7,B7 ;accum odd terms in B7;branch occurs here

Trip directive for loop count:

Linear assembly directive (.trip) is used to specify the number of times a loop iterates.

If the exact number is known and used, redundant loops are not generated and can improve both code size and execution time.

Software pipelining

• software pipelining is a scheme which uses available resources to obtain efficient pipelining code.

• The aim is to use all eight functional units within one cycle.

• Optimization levels –o2 and –o3 enable code generation to generate (or attempt to generate) software-pipelined code.

There are three stages:

1. prolog (warm-up)- This stage contains instructions needed to build up the loop kernel cycle.

2. Loop kernel (cycle)- within this loop, all instructions are executed in parallel.

Entire loop is executed in one cycle.

3. Epilog (cool-off)- This stage contains the instructions necessary to complete all iterations

Procedure for hand-coded software pipelining:1. Draw the dependency graph

2. Set up a scheduling table

3. Obtain code from the scheduling table.

Dependency graph: (Procedure)

1. Draw the nodes and paths

2. Write the number of cycles to complete an instruction

3. Assign functional units associated with each code

4. Separate the data paths, so that the maximum number of units are utilized.

Dependency Graph dot product (a) initial stage (b) Final stage

• A node has one or more data paths going in and/or out of the node

• The numbers next to each node represent the number of cycles required to complete the associated instruction.

• A parent node contains an instruction that writes to a variable;

• Child node contains an instruction that reads a variable written by the parent.

• The LDH instructions are considered to be the parents of the MPY instruction since the results of the two load instructions are used to perform the MPY instruction.

Dependency graph : (Eg. Two sum of product)

bi

Sum l

count loop

Sum h

Prod h

ai

Prod l

Side A Side B

LDW LDW

.D1.D2

.M1x .M2x

.L1.L2

.S1 .S2

MPY MPYH

ADD

SUB B

55

5 5

2 2

1

1 1

1

Scheduling table:1. LDW starts in cycle 1

2. MPY and MPYH must start five cycles after LDW, due to four delay slots.

Therefore MPY/MPYH starts at cycle 6.

3. ADD must start two cycles after MPY/MPYH due to one delay slot of MPY/MPYH. Therefore ADD starts in cycle 8.

4. B has 5 delay slots and starts in cycle 3, since branching occurs in cycle 9, after ADD instructions.

5. SUB instruction must start one cycle before branch instruction, since the loop count is decremented before branching occurs.

Therefore SUB starts in cycle 2.

Schedule table before software pipelining:

unitscycles

.D1

.D2

.M1

.M2

.L1

.L2

.S1

.S2

1,9,17.. 2,10,18.. 3,11,.. 4,12,.. 5,13,.. 6,14,.. 7,15,.. 8,16,..

LDW

LDW

SUB

B

MPY

MPYH

ADD

ADD

• Instructions within prolog stage (cycles 1-7) are repeated until and including loop kernel (cycle 8).

• Instructions in the epilog stage (cycles 9,10…) are to complete the functionality of the code.

Schedule table after software pipelining:

Loop Kernel

• Within the loop cycle 8, multiple iterations of the loop-execute in parallel. ie, different iterations are processed at same time.

eg: ADDs add data for iteration 1 MPY/MPYH multiply data for iteration 3 LDW load data for iterations 8 SUB decrements the counter for iteration 7 B branches for iteration 6

• ie, values being multiplied are loaded into registers 5 cycles prior to cycle when the values are actually multiplied. Before first multiplication occurs, fifth load has just completed.

• This software pipelining is 8 iterations deep.

• If the loop count is 100 (200 numbers)

Cycle 1: LDW, LDW (also initialization of count and accumulators A7 and B7)

Cycle 2: LDW, LDW, SUBCycle 3-5: LDW, LDW, SUB, BCycle 6-7: LDW, LDW, MPY, MPYH, SUB, BCycle 8-107: LDW, LDW, MPY, MPYH, ADD, ADD,

SUB, BCycle 108: LDW, LDW, MPY, MPYH, ADD, ADD,

SUB, B

• Prolog section is within cycle 1-7• Loop kernel is in cycle 8• Epilog section is in cycle 108.

Execution Cycles:

Number of cycles (with software pipelining):

Fixed point = 7+ (N/2) +1

eg: N = 200 ; 7+100+1 = 108

Floating points = 9 + (N/2) + 15

Fixed Point Floating Point

No Optimization 2 + (16 X 200) = 3202 2 + (18 X 200) = 3602

With parallel instructions 1 + (8 X 200) = 1601 1 + (10 X 200) = 2001

Two sums per iterations 1 + (8 X 100) = 801 1 + (10 X 100) + 7 = 1008

With S/W pipelining 7 + (200/2) + 1 = 108 9 + (200/2) +15 = 124

Memory Constraints:

• Internal memory is arranged through various banks of memory so that loads and stores can occur simultaneously.

• Since banks are single ported, only one access to each bank is performed per cycle.

• Two memory access per cycle can be performed if they do not access the same bank.

• If multiple access is performed to the same bank, pipeline will stall.

Cross Path Constraints:

• Since there is one cross path in each side of the two datapaths, there can be at most two instructions per cycle using cross path.

eg: Valid code segment (because both available cross paths are utilized )

ADD .L1X A1, B1, A0

|| MPY .M2X A2, B2, B3

eg: Not valid ( because one cross path is used for both instructions)

ADD .L1X A1, B1, A0

|| MPY .M1X A2, B2, A3

Load/store constraints:

• The address register to be used must be on the same side as the .D unit.

eg: Valid code: LDW .D1 *A1, A2

|| LDW .D2 *B1, B2

eg: Invalid code: LDW .D1 . *A1, A2|| LDW .D2 *A3, B2

• Loading and storing cannot be from the same register file.A load (or store) using one register file in parallel with another load (or store) must

use a different register file.

eg: Valid code: LDW .D1 *A0, B1

|| STW .D2 A1,*B2

eg: Invalid code: LDW .D1 *A0, A1

|| STW .D2 A2,*B2

TMS320C64x

• TMS320C64x is a family of 16-bit Very Long Instruction Word (VLIW) DSP from Texas Instruments

• Fixed point processor

• At clock rates of up to 1 GHz, C64x DSPs can process information at rates up to 8000 MIPS

• C64x DSPs can do more work each cycle with built-in extensions.

• They can process all C62x object code unmodified (but not vice-versa)

Applications for the C64x

TMS320C64x can be used as a CPU in the following devices:

Wireless local base stations;

Remote access server (RAS);

Digital subscriber loop (DSL) systems;

Cable modems;

Multichannel telephony systems;

Pooled modems;

Block diagramBlock diagram

Enhanced

DMA

Controller

(64-channel)

ZBT RAM

SDRAM

SBSRAM

FIFO

SRAM

I/O devices

L2

Memory

1024K

bytes

L1 Program cacheDirect-mapped16 K Bytes total

EMIF A

EMIF B

.

L1 Data cache2-way set-associative

16 K Bytes total

CPU CORE

C64X CPU

Features of TMS320C6413

- Based on the second-generation high-performance, advanced VelociTI− 500-MHz Clock Rate− 8000 MIPS− Eight 32-Bit Instructions/Cycle

DSP Core

− Eight Highly Independent Functional Units − Six ALUs (32-/40-Bit), Each SupportsSingle 32-Bit, Dual 16-Bit, or Quad 8-Bit Arithmetic per Clock Cycle

− Two Multipliers SupportFour 16 x 16-Bit Multiplies (32-Bit Results) per Clock Cycle or Eight 8 x 8-Bit Multiplies (16-Bit Results) per Clock Cycle

− Load-Store Architecture With Non-Aligned Support− 64 32-Bit General-Purpose Registers


Instruction Set Features− Byte-Addressable (8-/16-/32-/64-Bit data)− 8-Bit Overflow Protection− Bit-Field Extract, Set, Clear− Normalization, Saturation, Bit-Counting− Increased Orthogonality

L1/L2 Memory Architecture

− 128K-Bit (16K-Byte) L1P Program Cache (Direct Mapped)− 128K-Bit (16K-Byte) L1D Data Cache (2-Way Set-Associative)− 2M-Bit (256K-Byte) L2 Unified Mapped RAM/Cache [C6413]

(Flexible RAM/Cache Allocation)

Endianess: Little Endian, Big Endian


32-Bit External Memory Interface (EMIF)− Glueless Interface to Asynchronous Memories (SRAM and EPROM) and Synchronous Memories (SDRAM,SBSRAM, ZBT SRAM, and FIFO)− 512M-Byte Total Addressable External Memory Space

− Enhanced Direct-Memory-Access (EDMA) Controller (64 Independent Channels)− Host-Port Interface (HPI) [32-/16-Bit]− Two Multichannel Audio Serial Ports (McASPs) - with Six Serial Data Pins each−Two Multichannel Buffered Serial Ports− Three 32-Bit General-Purpose Timers− Sixteen General-Purpose I/O (GPIO) Pins− Flexible PLL Clock Generator

New enhancements

• Register file enhancements

• Data path extensions

• Quad 8-bit and dual 16-bit extensions with data flow enhancements

• Additional functional unit hardware

• Increased orthogonality of the instruction set

• Additional instructions that reduce code size and increase register flexibility

Register file enhancements

• The ’C64x register file has double the number of general-purpose registers than the ’C62x/’C67x cores

• There are 32 32-bit registers per data path A0-A31 for file A and B0-B31 for file B

• In all ’C6000 devices, registers A4-A7 and B4-B7 can be used for circular addressing.

Packed data processing

• The ’C64x register file supports all the ’C62x data types and extends this by additionally supporting packed 8-bit types and 64-bit fixed-point data types.

• Instructions operate directly on packed data to streamline data flow and increase instruction set efficiency.

• Packed data types store either four 8-bit values or two 16-bit values in a single 32-bit register or four 16-bit values in a 64-bit register pair.

• Besides being able to perform all the ’C62x instructions, the ’C64x also contains many 8–bit and 16–bit extensions to the instruction set.

Eg: MPYU4 instruction performs four 8x8 unsigned multiplies with a single instruction on a .M unit.

Data path extensions

• On the ’C64x, all eight of the functional units have access to the register file on the opposite side via a cross path.

• on the ’C62x/’C67x, only six functional units have access to the register file on the opposite side via a cross path; the .D units do not have a data cross path.

• The ’C64x pipelines data cross path accesses allowing multiple units per side to read the same cross path source simultaneously.

• In ’C62x/’C67x, only one functional unit per data path per execute packet could get an operand from the opposite register file.

• The ’C64x supports double-word loads and stores.

• There are four 32-bit paths for loading data for memory to the register file.

• For side A, LD1a is the load path for the 32 LSBs;

LD1b is the load path for the 32 MSBs.

• For side B, LD2a is the load path for the 32 LSBs;

LD2b is the load path for the 32 MSBs.

• There are also four 32-bit paths for storing register values to memory from each register file.

• ST1a is the write path for the 32 LSBs on side A;

ST1b is the write path for the 32 MSBs for side A.

• For side B, ST2a is the write path for the 32 LSBs and

ST2b is the write path for the 32 MSBs.

• The ’C64x can also access words and double words at any byte boundary using non-aligned loads and stores.

• As a result, word and double-word data does not always need alignment to 32-bit or 64-bit boundaries as in the ’C62x/’C67x

Additional Functional Unit Hardware

• the .L units can perform byte shifts and the .M units can perform bi-directional variable shifts in addition to the .S unit’s ability to do shifts.

• The .L units can now perform quad 8-bit subtracts with absolute value. This absolute difference instruction greatly aids motion estimation algorithms.

• Special communication-specific instructions, such as SHFL, DEAL and GMPY4, have been added to the .M unit to address common operations in error-correcting codes.

• Bit-count and rotate hardware on the .M unit extends support for bit-level algorithms such as binary morphology, image metric calculations and encryption algorithms.

Increased Orthogonality

• The .D unit can now perform 32-bit logical instructions in addition to the .S and .L units.

• Also, the .D unit now directly supports load and store instructions for double-word data values

• The ’C62x/’C67x allows up to four reads of a given register in a given clock cycle.

• The ’C64x allows any number of reads of a given register in a given clock cycle.

• On the ’C62x/’C67x, one long source and one long result per data path could occur every clock cycle.

• On the ’C64x, up to two long sources and two long results can be accessed on each data path every clock cycle.

General-Purpose Register Files

The C64x register file contains 32 32-bit registers (A0-A31 for file A and B0-B31 for file B);

can be used for data, pointers or conditions

Values larger than 32 bits (40-bit long and 64-bit float quantities) are stored in register pairs.

Packed data types are: four 8-bit values or two 16-bit values in a single 32-bit register, four 16-bit values in a 64-bit register pair.

Zero filled

Odd register Even register3239 31 0

PipelinePipeline

Fetch Decode Execute

The C64x pipeline has the following features:

11 phases divided into Fetch, Decode, Execute;

Fetch has 4 phases for all instructions, the decode phase has two phases for all instructions;

The execute stage of the pipeline requires a varying number of phases, depending on the type of the instruction.

The stages of the fixed-point pipeline are:

In the C64x instructions are fetched from the instruction memory in grouping of eight instructions, called fetch packets (FPs);

Each FP can be split into one to eight executable packets (EP). Each EP contains only instructions that can execute in parallel. Each instruction in EP executes in an independent functional unit;

The C64x pipe is most effective when it is kept as full as possible by organizing instructions;

Pipeline Stages

PG PS PW PR DP DC E1 E2 E3 E4 E5

Decode ExecuteFetch

Execute Pipeline Stages: E1

E1 E2 E3 E4 E5

Execute

• E1: Execute stage 1– Single cycle instructions are completed– For all instructions, conditions are evaluated and operands

are read– For load/store, address generation is performed, and

address modifications are written to register file– For branch instructions, branch fetch packet in PG phase

is affected– For single cycle instructions, results are written to register


E1 E2 E3 E4 E5

Execute

• E2: Execute stage 2– Multiply instructions are completed

– Load inst. sends address to memory

– Store inst. sends address and data to memory

– The SAT bit in the control status register (CSR) is set if a single cycle instruction saturated the result set

– Single 16x16 multiply inst. results are written to the register

– .M Unit non-multiply instructions are written to the register


E1 E2 E3 E4 E5

Execute

• E3: Execute stage 3– Store instructions are completed– Data memory accesses are performed– The SAT bit in the control status register (CSR) is set

for multiply instructions


E1 E2 E3 E4 E5

Execute

• E4: Execute stage 4– Multiply extension instructions are completed– Load instructions bring the data to the CPU– Multiply extension instruction (MPY2, MYP4,

DOTPx2, DOTPU4, MPYHIx, MPYLIx and MVD) results are written to the register


E1 E2 E3 E4 E5

Execute

• E5: Execute stage 5– Load instructions are completed– Load instruction data is written to the register

Pipeline summary

Instructions are decoded in functional unitsDCDecode

The next execute packet in the fetch packet is determined and sent to the appropriate functional units to be decoded

DPDispatchProgram decode

The fetch packet is at the CPU boundaryPRProgram data receive

A program memory access is performedPWProgram wait

The address of the fetch packet is sent to memoryPSProgram address send

The address of the fetch packet is determinedPGProgram address generate

Program fetch

During This PhaseSymbolPhaseStage

Pipeline summary

For load instructions, data is written into a register.E5Execute 5

For load instructions, data is brought to the CPU boundary. The results of multiply extensions are written to a register file.

E4Execute 4

Data memory accesses are performed. Any multiply instructions that saturates results sets the SAT bit in the control status register (CSR) if saturation occurs.

E3Execute 3

For load instructions, the address is sent to memory. For store instructions, the address and data are sent to memory.

Single-cycle instructions that saturate results set the SAT bit in the control status register (CSR) if saturation occurs.

E2 Execute 2

For all instruction types, the conditions for the instructions are evaluated and operands are read.

For load and store instructions, address generation is performed and address modifications are written to a register file.

For branch instructions, branch fetch packet in PG phase is affected

For single-cycle instructions, results are written to a register file.

E1Execute 1Execute

Delay Slots

• Delay slots mean “how many CPU cycles come between the current instruction and when the results of the instruction can be used by another instruction”

• Single Cycle Instructions: 0 delay slots• 16x16 Single Multiply and .M Unit non-multiply

Instructions: 1 delay slot

• Store: 0 delay slots– If a load occurs before a store (either in parallel or not),

then the old data is loaded from memory before the new data is stored.

– If a load occurs after a store, (either in parallel or not), then the new data is stored before the data is loaded.

• C64x Multiply Extensions: 3 delay slots• Load: 4 delay slots• Branch: 5 delay slots

– The branch target is in the PG slot when the branch condition is determined in E1. There are 5 slots between PG and E1 when the branch target begins executing useful code again.

Memory The C64x has different spaces for program and data memory;

Uses two-level cache memory scheme;

Internal MemoryInternal Memory

The C64x has a 32-bit byte-addressable memory with the following features:

Separate data and program address spaces;

Large on chip RAM, up to 7MB;

2-level cache;

Single internal program memory port with an instruction-fetch bandwidth of 256 bits;

Two 64-bit internal data memory ports;

Memory Map (Internal and External Memory)

• Level 1 Program Cache is 128 Kbit direct mapped

• Level 1 Data cache is 128Kbit 2-way set-associative

• Shared Level 2 Program/Data Memory/Cache of 4Mbit – Can be configured as mapped memory– Cache (up to 256 Kbytes)– Combination of the two

Memory Buses

• Instruction fetch using 32-bit address bus and 256-bit data bus

• two 64-bit load buses (LD1 and LD2)

• two 64-bit store buses (ST1 and ST2)

Peripheral Set

• 2 multichannel buffered audio serial ports• 2 inter-integrated circuit bus modules (I2Cs)• 2 multichannel buffered serial ports (McBSPs)• 3 32-bit general-purpose timers• 1 user-configurable 16-bit or 32-bit host-port interface

(HPI16/HPI32)• 1 16-pin general-purpose input/output port (GP0) with

programmable interrupt/event generation modes• 1 32-bit glueless external memory interface (EMIFA),

capable of interfacing to synchronous and asynchronous memories and peripherals.

ZBT RAM

• Zero Bus Turnaround (ZBT) is a synchronous SRAM architecture optimized for networking and telecommunications applications.

• It can increase the internal bandwidth of a switch fabric when compared to standard SyncBurst SRAM.

• The ZBT architecture is optimized for switching and other applications with highly random READs and WRITEs.

• ZBT SRAMs eliminate all idle cycles when turning the data bus around from a WRITE operation to a READ operation

Interfacing C and Assembly Language Interfacing C and Assembly Language

When an assembly function is called from C the values passed to the function will be stored in specific registers.

The first 10 arguments passed to an assembly function will be stored to registers A4, B4, A6,B6, A8, B8, A10, B10, A12, B12Any additional arguments will be stored in a stack

The even registers are used when 32-bits of data (or less) are being passed to each register. When a 64-bit (double precision floating-point number) is passed to a function, it is stored in adjoining registers (e.g. A4:A5, B4:B5, A6:A7, etc.)

Upon returning from a called function, only one value may be returned. By convention, the value in register A4 will be returned.

Sum of products exampleSum of products example

C code:

int DotP(short* m, short* n, int count)

{ int i, product, sum = 0; for(i = 0; i < count; i++) { product = m[i] * n[i]; sum+=product; } return(sum);}

TI TMS C64x code:

LOOP:

[A0] SUB .L1 A0, 1, A0

| | [!A0] ADD .S1 A6, A5, A5

| | MPY .M1X B4, A4, A6

| | [B0] BDEC .S2 LOOP, B0

LDH .D1T1 *A3++, A4

LDH .D2T2 *B5++, B4

Another code example Another code example MIPS:

loop: LW R1, 0(R11) MUL R2, R1, R10 SW R2, 0(R12) ADDI R12, R12, #-4 ADDI R11, R11, #-4 BGTZ R12, loop

TI TMS C64x:

ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MVK .S2 #-4,B1

ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || ADDK .S2 #-12,B12

loop: ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12) ||

ADD .L2 B12,B1,B12 || BGTZ .S2 B12, loop

ADD .L2 B12, B1, B12 || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12)

ADD .L2 B12, B1, B12 || STW .D2x A2,0(B12)

Special purpose instructions

GSMSigned variable shiftSSHVL, SSHVR

Motion estimationQuad 8-bit Absolute of differences

SUBABS4

Motion compensationQuad 8-bit, Dual 16-bit averageAVGx

AudioExtended precision 16x32 MPYsMPYHIx, MPYLIx

GraphicsBit expansionXPNDx

Endian swapByte swapSWAP4

Cable modemBit de-interleavingDEAL

Convolution encoderBit interleavingSHFL

Reed Solomon supportGalois Field MPYGMPY4

Machine visionBit counterBITC4

Example ApplicationDescriptionInstruction

//Factorial.c Finds factorial of n. Calls function factfunc.asm

#include <stdio.h> //for print statement

void main(){ short n=7; //set value short result; //result from asm function

result = factfunc(n); //call assembly function factfunc printf("factorial = %d", result); //print result from asm function}

• ;Factfunc.asm Assembly function called from C to find factorial

•

• .def _factfunc ;asm function called from C

• _factfunc: MV A4,A1 ;setup loop count in A1

• SUB A1,1,A1 ;decrement loop count

• LOOP: MPY A4,A1,A4 ;accumulate in A4

• NOP ;for 1 delay slot with MPY

• SUB A1,1,A1 ;decrement for next multiply

• [A1] B LOOP ;branch to LOOP if A1 # 0

• NOP 5 ;five NOPs for delay slots

• B B3 ;return to calling routine

• NOP 5 ;five NOPs for delay slots

• .end

module 2. syllabus fixed and floating point formats code improvement constraints tms 320c64x cpu...

Documents

exponent bits bits

fractional bits bits

bits double precision

fractional bits numbers

unsigned slide

division slide

code improvement code

code size