unit-iv

Code OptimizationWord-Wide OptimizationMixing C and AssemblyTo mix C and assembly, it is necessary to know the register convention used by thecompiler to pass arguments. This convention is illustrated in Figure. DP, the basepointer, points to the beginning of the .bss section, containing all global and staticvariables. SP, the stack pointer, points to local variables. The stack grows from highermemory to lower memory, as indicated in Figure. The space between even registers(odd registers) is used when passing 40-bit or 64-bit values.

Software PipeliningSoftware pipelining is a technique for writing highly efficient assembly loop codeson the C6x processor. Using this technique, all functional units on the processorare fully utilized within one cycle. However, to write hand-coded software pipelinedassembly code, a fair amount of coding effort is required, due to the complexity andnumber of steps involved in writing such code. In particular, for complex algorithmsencountered in many communications, and signal/image processing applications,hand-coded software pipelining considerably increases coding time. The C compilerat the optimization levels 2 and 3 (o2 and o3) performs software pipelining tosome degree. Compared with linear assembly, the increase in codeefficiency when writing hand-coded software pipelining is relatively slight.Linear AssemblyLinear assembly is a coding scheme that allows one to write efficient codes (comparedwith C) with less coding effort (compared with hand-coded software pipelinedassembly). The assembly optimizer is the software tool that parallelizes linear assemblycode across the eight functional units. It attempts to achieve a good compromisebetween code efficiency and coding effort. In a linear assembly code, it is not required to specify any functional units, registers,and NOPs.The directives .proc and .endproc define the beginning and end,respectively, of the linear assembly procedure. The symbolic names p_m, p_n, m, n,count, prod, and sum are defined by the .reg directive. The names p_m, p_n,and count are associated with the registers A4, B4, and A6 by using the assignmentMV instruction.

Hand-Coded Software PipeliningFirst let us review the pipeline concept. As canbe seen from this figure, the functional units in the non-pipelined version are notfully utilized, leading to more cycles compared with the pipelined version. There arethree stages to a pipelined code, named prolog, loop kernel, and epilog. Prolog correspondsto instructions that are needed to build up a loop kernel or loop cycle, andepilog to instructions that are needed to complete all loop iterations. When a loopkernel is established, the entire loop is done in one cycle via one parallel instructionusing the maximum number of functional units. This parallelism is what causes areduction in the number of cycles.

Three steps are needed to produce a hand-coded software pipelined code from a linearassembly loop code: (a) drawing a dependency graph, (b) setting up a schedulingtable, and (c) deriving the pipelined code from the scheduling table.In a dependency graph, the nodes denoteinstructions and symbolic variable names. The paths show the flow of data and areannotated with the latencies of their parent nodes. To draw a dependency graph forthe loop part of the dot-product code, we start by drawing nodes for the instructionsand symbolic variable names.

After the basic dependency graph is drawn, a functional unit is assigned to each nodeor instruction. Then, a line is drawn to split the workload between the A- and B-sidedata paths as equally as possible. It is apparent that one load should be done on eachside, so this provides a good starting point. From there, the rest of the instructionsneed to be assigned in such a way that the workload is equally divided between theA- and B-side functional units. The dependency graph for the dot-product exampleis shown in Figure.

The next step for handwriting a pipelined code is to set up a scheduling table. Todo so, the longest path must be identified in order to determine how long the tableshould be. Counting the latencies of each side, we see that the longest path is 8. Thismeans that 7 prolog columns are required before entering the loop kernel.Thus, asshown in Table, the scheduling table consists of 15 columns (7 for prolog, 1 forloop kernel, 7 for epilog) and eight rows (one row for each functional unit). Epilogand prolog are of the same length. Next, the code is handwritten directly from the scheduling table

C64x ImprovementsThis section shows how the additional features of the C64x DSP can be used tofurther optimize the dot-product example. Figure (b) shows the C64x version ofthe dot-product loop kernel for multiplying two 16-bit values. The equivalent C codeappears in Figure (a).

As shown in Figure 7-19(a), in C, these can be achieved byusing the intrinsic _dotp2() and by casting shorts as integers. The equivalent loopkernel code generated by the compiler is shown in Figure 7-19(b), which is a doublecycleloop containing four 16 * 16 multiplications. The instruction LDW is used tobring in the required 32-bit values.

Considering that the C64x can bring in 64-bit data values by using the double-wordloading instruction LDDW, the foregoing code can be further improved by performingfour 16 * 16 multiplications via two DOTP2 instructions within a single-cycleloop, as shown in Figure (b). This way the number of operations is reduced byfour-fold, since four 16 * 16 multiplications are done per cycle. To do this in C, we need to cast short datatypes as doubles, and to specify which 32 bits of 64-bit data aDOTP2 is supposed to operate on. This is done by using the _lo() and _hi() intrinsicsto specify the lower and the upper 32 bits of 64-bit data, respectively. Figure (a) shows the equivalent C code.

Circular BufferingIn many DSP algorithms, such as filtering, adaptive filtering, or spectral analysis, weneed to shift data or update samples (i.e., we need to deal with a moving window).The direct method of shifting data is inefficient and uses many cycles. Circular bufferingis an addressing mode by which a moving-window effect can be created withoutthe overhead associated with data shifting. In a circular buffer, if a pointer pointingto the last element of the buffer is incremented, it is automatically wrapped aroundand pointed back to the first element of the buffer. This provides an easy mechanismto exclude the oldest sample while including the newest sample, creating a movingwindoweffect as illustrated in Figure.

Some DSPs have dedicated hardware for doing this type of addressing. On the C6xprocessor, the arithmetic logic unit has the circular addressing mode capability builtinto it. To use circular buffering, first the circular buffer sizes need to be written intothe BK0 and BK1 block size fields of the Address Mode Register (AMR), as shownin Figure . The C6x allows two independent circular buffers of powers of 2 in size.Buffer size is specified as 2(N+1) bytes, where N indicates the value written to the BK0and BK1 block size fields.

Then, the register to be used as the circular buffer pointer needs to be specified bysetting appropriate bits of AMR to 1. For example, as shown in Figure , for usingA4 as a circular buffer pointer, bit 0 or 1 is set to 1. Of the 32 registers on the C6x, 8can be used as circular buffer pointers: A4 through A7 and B4 through B7. Note thatlinear addressing is the default mode of addressing for these registers.

Adaptive FilteringAdaptive filtering is used in many applications ranging from noise cancellation tosystem identification. In most cases, the coefficients of an FIR filter are modifiedaccording to an error signal in order to adapt to a desired signal. In this lab, a systemidentification example is implemented wherein an adaptive FIR filter is used toadapt to the output of a seventh-order IIR bandpass filter. The IIR filter is designedin MATLAB and implemented in C. The adaptive FIR is first implemented in C andlater in assembly using circular buffering.In system identification, the behavior of an unknown system is modeled by accessingits input and output. An adaptive FIR filter can be used to adapt to the output of thesystem based on the same input. The difference in the output of the system, d[n], andthe output of the adaptive filter, y[n], constitutes the error term e[n], which is used toupdate the coefficients of the FIR filter.

The error term calculated from the difference of the outputs of the two systems isused to update each coefficient of the FIR filter according to the formula (least meansquare (LMS) algorithm [1]):

where the hs denote the unit sample response or FIR filter coefficients. The outputy[n] is required to approach d[n]. The term indicates step size. A small step size willensure convergence, but results in a slow adaptation rate. A large step size, thoughfaster, may lead to skipping over the solution.

unit-iv

Documents