unc project report

UNC Project ReportCOMP SCI 265

A PROGRAMMABLE VLIW DIGITAL SIGNAL PROCESSOR

for

MPEG-2 VIDEO DECODING

for

Dr. Frederick P. Brooks Jr.

by:

John GlossnerDepartment of Computer Science

The University of North Carolina at Chapel [email protected]

[email protected]

The mVLIW MPEG Digital Signal Processor Comp 265

Introduction January 10, 2003 Page 0 of 43

1.0 IntroductionThis project takes work that was done at IBM in Research Triangle Park and builds upon it by extending the

functionality of a an IBM parallel embedded computer designed by the IBM Mwave group [1]. The goal of the project was to provide a uni-processor solution for the IBM Mfast family that is capable of decoding MPEG-2 video streams. Video was the focal point since the current generation Mwave processor (MSP1.0) is fully capable of decoding the audio stream.

In summary, this project looks at the characteristics of the MPEG-2 application including the market, performance requirements, unique functions, and how the mVLIW can speed up MPEG video decoding. The results obtained show that the unoptimized inverse Discrete Cosine Transform (DCT) code meets the requirements of MPEG video decoding. Optimizing the code should allow the entire MPEG-2 decompression to run very efficiently on the mVLIW uni-processor.

2.0 Application Characteristics

2.1 Market AnalysisThis project is being pursued due to the large opportunity in video compression and decompression. Forward

Concepts has estimated that by 1997, the PC JPEG/MPEG decoder market would grow by 180% compounded annual growth rate based on data and estimates from 1992 to 1997.1 Forward Concepts also points out the tremendous growth and market opportunity for the DSP market in general and the video decoding market in particular. The latest estimates continue to show that the general Programmable DSP and FASICs chips should grow at a compound annual growth rate of 35% from 1994 to 1999 reaching a net value of $4.5 million and $6.1 million respectively.2 Additionally, the majority of the DSP IC market shows that ISDN and Multimedia / Videophone markets have the greatest market value. Finally, Forward Concepts points out that the FASIC and Programmable DSP markets are by far the majority of all DSPs designed.

2.2 Clarification of DefinitionsThe term general purpose digital signal processor and programmable digital signal processor will generally

refer to the same design style. While called a general purpose DSP, it does not typically denote a general micro-processor. The distinction between the two precludes the use of most programmable DSPs as a host processor in a general PC environment. Additionally, the distinction between FASIC DSP and programmable DSP is not always clear. Many FASIC DSPs contain some level of programmability.

For the purposes of this project, the term programmable DSP will be used to denote a DSP intended for a specific market segment (MPEG decompression in this case) that has the capability of executing stored instructions but does not contain the general mechanisms required to support general purpose PC applications.

2.3 MPEG-2 Decompression CharacteristicsMPEG-1 decompression performance characteristics are such that the computational complexity required to

decode the MPEG-1 streams can be satisfied by more powerful host processors3. However, because the computational requirements for MPEG-2 are much higher, it is anticipated that special purpose processors will fulfill this role even with new more powerful Alpha, PowerPC, and Intel P6 processors becoming available4.

MPEG-2 essentially takes a 200-300 Mbit/sec 30 frame/sec video stream and compresses it to 2-10 Mbit/sec. This asymmetrical compression technique makes real-time compression capable only on the most powerful processors or dedicated hardware but allows real-time decompression to be achieved with inexpensive special purpose processors such as the one proposed for this project.

The key application features designed for the MPEG standard are5:

1. Forward Concepts - DSP Strategies for the 90’s: The Compression Imperative2. EE Times, January 23, 1995 quoting the latest Forward Concepts data.3. EE Times, January 30, 1995. Softer Compression Ahead, pg 24.4. EE Times, February 20, 1995. NSP Challenges DSP in PC Architecture, pg 22.5. Le Gall, Didier J, Signal Processing: Image Communication 4 (1992) 129-140, Elsevier Science Publishers, B.V.


Application Characteristics January 10, 2003 Page 1 of 43

1. Random access time of approximately 1/2 second for CD type media. This requires that a compressed video stream be accessible in its middle and any from of video be decodable within this time.

2. Fast forward / reverse searches. This is a more demanding form of random access.

3. Reverse playback.

4. Audio-visual synchronization.

5. Robustness to errors.

6. Coding / Decoding delay of under 150ms in order to maintain conversational, “face-to-face” communica-tions.

7. An acceptable level of editability in compressed form.

8. Format flexibility of raster sizes.

2.4 MPEG Decoder SystemFigure 1, "MPEG Decoder" is based upon [11].

Figure 1. MPEG Decoder

2.5 MPEG-2 Decompression AnalysisThe following characteristics are based on an extensive review of an MPEG-2 software decoder written in C

which was obtained from IBM in Endicott, New York but is based on code generally available over the Internet. The decoder is an implementation of the ISO/IEC DIS 13818-2 decoder. It also has the capability of decoding MPEG-1 streams based on the ISO/IEC IS 11172-2 specification.

Based on an analysis of the C code where all of the subroutines were reverse engineered into flow diagrams, the following results emerged from the code:

1. The motion_vectors() routine performs scaling by powers of 2 making a shift instruction attractive.

2. The predominant mode of motion_vectors() is index and computation arithmetic.

3. reconstruct() predominantly uses shifts and adds.

Mpeg-compressed bitstream

Header Decode

Huffman Decode

Inverse Quantization

IDCT

Display

Motion

Motion Vectors

Compensation

ScaleFactor

+


Application Characteristics January 10, 2003 Page 2 of 43

4.The inverse 8x8 2D DCT requirements are extensive requiring 11 integer multiplies and 29 integer adds per DCT for a total of 176 multiplies and 464 adds per 8x8 block.

In addition to the above results, it is known1 that the IDCT can be re-arranged in a more symmetric form which reduces the total number of operations but increases the number of multiplies required. For many general purpose processors where multiplication can take many cycles, this is not a good trade-off. In DSP applications, it is preferable to trade-off total cycles for multiplications because most DSPs implement single cycle multiplication.Given the computationally intensive inner loop of the 2D 8x8 idct(), a single cycle multiply is essential to enabling real-time MPEG-2 decompression.

One further characteristic of the MPEG-2 decoding application is movement of data into and out of the processor to enable the inner loop to run continuously. In a typical system, an analog video chip breaks up the analog signal into a video stream which is then encoded and transmitted digitally to the receiving decoder. As stated previously, the MPEG-2 specification allows for bit streams in the range of 2-10 Mbits/second. To maintain a real-time 30 frame/sec decoding capability, the data rate requires the concurrent capability to read and store data while doing the necessary inner loop computations. Therefore, a key design criteria based on the MPEG-2 application is that a concurrent Load, Multiply, and Store operation be able to run concurrently.

Considering the above analysis, the following is estimated of the processing requirements of the MPEG-2 application:

1. Pennebaker, William B. JPEG Still Image Data Compression Standard, Van Nostrand Reinhold, New York, 1993.

Subroutinea

a. * denotes actual counts per subroutine call based on manual interpretation of the C code.

Load Add Mpy Store

fill_buffer() * 8k bytes indx 8kB 0 22

idct() * (8x8 blk) 64 464 176 64

out_buffer() 22 indx 8kB 0 8k bytes

TABLE 1. MPEG-2 Instruction Frequencies

motion_vectors()

reconstruct()

idct()

MPEG-2 INNER LOOP


DSP Characteristics January 10, 2003 Page 3 of 43

2.6 MPEG Execution Time DistributionThe following execution time was obtained from [11].

The mVLIW focuses on improving the execution time of the IDCT given that it is the predominant computation in the decoder.

3.0 DSP CharacteristicsIn contrast to general purpose processors, DSPs have certain characteristics that specialize to computationally

intensive co-processing tasks. Below is a summary of the key differences:

1. DSPs typically perform inner loop computations efficiently by supplying a single cycle multiply.

2. DSPs typically have a shorter pipeline (3 to 4 phases) for low latency computations.

3. DSPs typically have rudimentary memory management and a fully visible memory (in contrast with a cache) which the programmer is required to manage.

4. Forward Concepts points out that the majority of DSPs are fixed point. In contrast, nearly all general pur-pose micro-processors have floating point capability.

5. DSPs excel at compound instructions. Instead of superscalar architectures, DSPs typically imbed separate control fields in the instruction to allow a concurrent address generation, alu computation, and multiplica-tion.

6. DSP code is often programmed in assembler. Nearly all general purpose PC code is written in a high level language.

7. DSPs are driven by speed, speed and more speed in inner loop computations. General purpose processors must accept a wide range of programming and user interaction situations.

Based on data from Hennessy and Patterson1, a typical instruction mix is shown below. For comparison, the anticipated MPEG-2 decompression numbers are shown in the last column.

Function Execution TimeIDCT 38.7%Display 33%Motion Compensation 18.3%Huffman Decode 7.5%Inverse Quantize 2.4%Header Decode 0.1%

TABLE 2. MPEG Decoder Execution Time Distributions

1. Hennessy, John L., and Patterson, David A, Computer Architecture: A Quantitative Approach, 2nd edition Beta copy, Morgan Kaufmann Publishers, Inc., San Fran-cisco, CA, 1994, pg 143 modified to condense equivalent operations.

Type gcc x86

gcc DLX

spice x86

spice DLX

MPEG-2esta

a. Pre-project Estimate

MPEG-2idct-actb

b. Based on C-Model of mVLIW for an unrolled loop assuming data in data memory when needed

Fx L/S 42% 43% 34% 23% 40% 16%Fx Arith 27% 26% 33% 33% 27% 67%Fx Mpy 1% n/a 1% n/a 13% 17%Fx Logic 7% 10% 4% 5% 5% n/aFx Ctl 23% 17% 14% 11% 10% n/aFloat Pt 0% 0% 13% 13% n/a n/aMisc. 0% 4% 0% 5% 5%

TABLE 3. General Purpose Instruction Mix Vs. MPEG-2


Notes On C-Code Model January 10, 2003 Page 4 of 43

3.1 DSP Applicability To MPEG-2 DecompressionBased on the above analysis, a special purpose digital signal processor should excel at MPEG-2

decompression. If the arithmetic portion of the general purpose processor versus MPEG decompression is compared, the need for a special purpose multiplier becomes evident. In the general purpose case, the ratio of multiplies to other arithmetic operations is 1%. In the MPEG-2 case, the ratio is 33% of the arithmetic computations.

For these reasons, a special purpose programmable DSP is especially well suited to the MPEG-2 video decompression market. In addition, a VLIW capability will allow for programmable concurrent Load, Store, and Execute capability to effectively implement the MPEG-2 decompression standard in real-time.

4.0 Notes On C-Code ModelThe C-code shown in the design book is significantly modified based upon early code that was written to

perform architectural trade-offs and performance analysis on the design. All of the code was written for this project. In many cases, the new architectural features that have been added as a result of performance analysis are only sketched in the C-code1. The model shown in this document has not been verified. The original code can be executed from glossner/Classes/265/Mpeg. It requires an IBM AIX machine to run the binaries without recompiling. It will give performance analysis results for the IDCT (cycles executed, percents of arithmetic operations, load/store counts, etc.). For the purposes of this databook, illustrative portions of the original C-code have been modified to conform more to the notation of B&B. The original mVLIW model was only intended to measure work loads and computational performance. Modifications were made to the mVLIW architecture based on the original results.

1. An example of this is CASE( sp[0].ctlRegs.SASR[1] == 0 ) means do all the proper things if a shift needs to be done. Another example is CASE( size, p ) means decode all the things required to support byte/half word/word/double word sizes and align them based upon the p bit.

Term Definition

sp[0] sequence processor 0. A system may have multiple mVLIW proces-sors. A structure of them facilitates naming.

SP A structure grouping items normally found in a sequence processor (memory, registers, etc.)

RegsSP A structure that allows various types (unsigned, int, etc.) to be mapped to an SP memory/register name.

CtlRegsSP A structure that allows various control registers to be mapped to names. It also allows specifications of architecturally invisible regis-ters.

rtClk A global variable that all instructions update to keep track of execu-tion time.

iFile A file that conditionally outputs the instruction stream executing on the processor for later analysis.

mfast1x0.h The include file that has the macro instruction definitions

mfregs.h The include file that has the register name mappings and control structures.

mf2dec.c Main C file that contains global variables and mVLIW initialization.

idct.c Inverse Discrete Cosine C-code with mVLIW instructions being sim-ulated.

TABLE 4. mVLIW C-Model Definitions


Highlights mVLIW DSP January 10, 2003 Page 5 of 43

5.0 Highlights mVLIW DSP

5.1 History Although its name resembles that of the IBM Mwave family of digital signal processors, the mVLIW is a

radical departure from traditional IBM DSP architectures. The mVLIW is IBM’s minimal implementation of an entire family of scalable embedded digital signal multi-processors descended from the Mfast family.

Architects: John Glossner architected the mVLIW uni-processor addressing modes, interrupts, protected instructions, and I/O subsystem. Jerry Pechanek architected the original Mfast (Mwave Folded Array Signal Transform) parallel compute engine, the surrogate programmable VLIW concept, and the instruction sequence control processor. The original MSP1.0 was architected by Gardner Jones and Larry Larson. Ross Ogilvie, Bart Blanar, and Paul Stabler enhanced the address space to 32 bits in the MSP1.1 architecture. John Glossner implemented the entire Mfast parallel compute engine and surrogate concepts in VHDL.

Dates: The original Mfast compute engine was architected in 1993. System addressing, control functions, and uni-processor operations were architected in 1995 (for this project). The original IBM Mwave multimedia MSP1.0 was first shipped in 1992. The MSP1.1 architecture was first shipped in 1995.

Family Tree: The Mwave Signal Processor 1.0 (MSP1.0) follows the long line of IBM Digital Signal Processors by providing single cycle multiplies and eight 16b general purpose registers. The MSP1.1 (1995) extends the MSP1.0 processor by enhancing the memory address from 32kW to a general 32bit address space. The mVLIW processor breaks the IBM MSP tradition by providing parallelism through a programmable VLIW versus an explicit compound instruction set.

5.2 Noteworthy The mVLIW processor is the first processor to incorporate the concept of a programmable VLIW instruction

memory [6]. These memories are constructed during program execution and then referenced in later instructions to provide a system where all instruction fetches are 32 bits but can indirectly access up to a 256 bit instruction word. This allows for up to 8 execution units. Since the same instruction set is used in IBM’s advanced parallel processors, the mVLIW allows a low cost entry point to a wide range of parallel IBM multi-processors.

Address capacity: The use of a 32 bit byte addressed memory allows all processors in the system to execute from the same address space or separate spaces with memory embedding in a common global space.

8/16/32/64-bit Data, 32-bit addresses: The data size while variable is based upon a 32-bit standard length. The common 32-bit data length allows for sharing of the arithmetic units and address calculations. Even with the common data length, the mVLIW provides two independent address generators for concurrent load/store capability.

Havard Architecture: Separate Instruction and Data memories are provided to ensure the compute engine does not stall while fetching instructions.

Programmable VLIW Capability: The mVLIW does not store instructions in main memory as 256 bit words. Rather, the mVLIW builds the VLIW instructions during execution by using special segment delimiter meta-instructions. The VLIW instructions are then accessed by indirect surrogate instructions. Optionally, commonly used functions can be placed in ROMs on the chip for 0 latency VLIW execution.

Butterfly Operations. The mVLIW provides two special butterfly operations that are used in signal processing applications such as fast fourier transforms (FFTs), discrete cosine transforms (DCTs), and their inverses.


Machine Language January 10, 2003 Page 6 of 43

5.3 Peculiarities

Little/Big Endian Addressing Modes. The byte labelling scheme is a controversial decision even among IBM architects who typically prefer Big Endian. It is avoided somewhat by providing both modes for data transfers and a Byte Swap operation for characters. Character transfers are the responsibility of the programmer. The inherent mVLIW processor is a Little Endian machine due to the intended x86 coprocessor market it is targeted for. Big Endian transfers are specified by setting a bit in the global mode register.

No Floating Point. Because the MPEG market is very cost sensitive and because there are well known fixed point MPEG decompression algorithms, floating point instructions are not provided for. While it is arguably easy to specify floating point instructions, the implications to a special purpose single cycle multiply are significant for the size of the implementation.

6.0 Machine Language

6.1 Language Level

Design Philosophy. The mVLIW is designed as a compute engine. VLIW instructions are intended to be programmed relatively infrequently and then executed many times. Multiple functional units perform computationally intensive tasks. Software pipelining is facilitated by the single cycle execute which characterizes most DSP processors.

6.2 Unit System The basic unit system is the 8-bit byte. Operations are provided for 16-bit half words, dual 16-bit half words,

32-bit words, dual 32-bit words, and 64-bit double words.

Spaces. The memory space is linear and segmented into four distinct segments: Global Memory, Local Instruction Store, Local Data Store, and Local I/O specified by the two most significant bits of the address as shown in Figure 2, "mVLIW Address Space Description."

Figure 2. mVLIW Address Space Description.

32-Bit Linear Address Space

A31,30 = 01

Ring 3 Access

Ring 0 Access

Ring 1 Access

Ring 2 Access

I/O Host

Ring 0 has access priviledgeto Ring1 but Ring 1 doesnot have access priviledgeto Ring 0, etc.

All external devices that arenot memory mapped I/Omust communicate throughI/O space at Ring 0 priviledge

0 1: SDRAM memory

0 0: Local Data Store

1 0: Local I/O

1 1: Local Instruction Store

A31,30 = 00

A31,30 = 10

A31,30 = 11



Configuration. The machine described has 4MB of SDRAM installed, 5 execution units (Load, Store, Alu, Multiply, Data Selector) implying 160 bit surrogate instruction memory length, 16 surrogate address locations, 2MB of I/O memory, and 4kB of instruction and data memory.

/*********************************************************************************//* FormatVLIW(): mVLIW information units *//*********************************************************************************/FormatVLIW(){

radix = 2; byte = 8; hword = 16; word = 32; /* double word */ dword = 64; /* quad word */ adrsize = 32; adrcap = radix**adrsize; /* address capacity */

simunits = 8; /* 8 execution units */unit1 = “Store”;unit2 = “ALU”;unit3 = “MAU”; /* multiplier */unit4 = “DSU”; /* data selector unit */unit5 = “Load”;

simsize = 11; /* 2k max surrogates */ simcap = radix**simsize;}

Code Listing 1. FormatVLIW(): mVLIW information units / formats

/*********************************************************************************//* ConfigureVLIW(): an implementation of the mVLIW architecture *//*********************************************************************************/ConfigureVLIW(){ dstoreCap = radix**12; /* 4kB dstore */ istoreCap = radix**12; /* 4kB istore */ simCap = radix**4; /* 16 surrogate instruction memory locations */ sdramCap = radix**22; /* 4MB synchronous dram */ iocap = radix**20; /* 1MB I/O installed memory */}

Code Listing 2. ConfigureVLIW(): mVLIW configuration

/*********************************************************************************//* InitiateVLIW(): initiation of the mVLIW processor *//*********************************************************************************/InitiateVLIW(){

FormatVLIW(); ConfigureVLIW(); SpaceVLIW(); NameVLIW(); ControlVLIW();}

Code Listing 3. InitiateVLIW(): mVLIW initiation

Memory name-space. The Local instruction store name space supplies the processor with a protected mechanism to avoid malicious self-modifying user code or errant processes from interfering with normal instruction sequencing. This space is normally Read Only from the mVLIW perspective. Only Ring 0 devices may access memory in this range. All reads and writes from non-Ring0 programs are suppressed and an error is signaled.

The Local I/O space provides a convenient mechanism to isolate external peripheral devices without the requirement of supplying an I/O subsystem to the processor. The host x86 is required to service intensive I/O tasks. The isolated memory space facilitates NTSC to digital type operations by directly mapping the output into the mVLIW processor’s memory space.

The SDRAM memory space is for fast level 2 memory. It can be embedded in the global system address space by memory mapped I/O and can be accessed directly by mVLIW address references. This allows frame buffers and graphics addresses to be conveniently accessed by the host processor and for the mVLIW chip to directly access it.



The Local Data Store space provides an isolated area for intermediate results./*********************************************************************************//* SpaceVLIW(): mVLIW spaces *//*********************************************************************************/SpaceVLIW(){ Memory(); /* Memory Space */

Gprs(); /* Working store */Sprs(); /* 65k Special Purpose Registers: Some are named (ie: base regs) */BaseRegs(); /* Base Registers */Sim(); /* Surrogate Instruction Memory */Indicators(); /* Indicators */Status(); /* Processor Status */ControlReg(); /* Machine Control Register */Halt(); /* Machine Stop */Wait(); /* Machine Wait */

}

Code Listing 4. SpaceVLIW(): mVLIW spaces

Working store. 32 architected general purpose registers are used as source and destination registers for operations and as index registers in addressing. Non 32-bit operations do not affect non-specified bits in the data registers.

Control Store. In addition to the instruction address, there are registers for memory protection, status, error reporting, and start/stop. These hold up to 64kWords of special registers.

Embedding. The mVLIW Local Instruction, Data, SDRAM and I/O storage are all embedded in the local memory map. Additionally, the Special Purpose Registers are located in a separate I/O memory map addressable to 64kW. The Special Purpose I/O space contains the processor status information such as cycle counts, status conditions, base registers, and other mVLIW control information. These embeddings are primarily to facilitate policing of addresses.

/*********************************************************************************//* NameVLIW(): mVLIW space names *//*********************************************************************************/NameVLIW(){

SDRAMMemory(); /* SDRAM Memory */LocalDataStore(); /* Local Data Store */LocalInstructionStore(); /* Local Instruction Store */LocalIO(); /* Local I/O */Special(); /* Processor Status & Control Embedding */

}

Code Listing 5. NameVLIW(): mVLIW space names

Programming Model. The programming model is shown in Figure 3, "mVLIW Programming Model", on page 9. Instructions are fetched from the multi-port instruction memory. They are quickly decoded to determine if a VLIW operation is in progress. If a VLIW needs to be executed, the Surrogate Instruction Memory Address is decoded and the compound 160-bit instruction is retrieved for parallel decode and execution. If the initial instruction fetched from Instruction store is a simplex (non-VLIW) instruction, it is sent to the decode unit for decode and execution.



Figure 3. mVLIW Programming Model

6.3 Operand Specification

Number of addresses. All arithmetic instructions are abbreviated 3 address instructions with source and destinations being registers as in the traditional RISC style. Loads and stores specify one register address and one memory address.

Address Phrase. Operand addresses have byte resolution. All addresses are direct and formed from a 32-bit base register, a 32-bit offset, and a concatenated dual 16-bit offset/displacement register as shown in Figure 5, "mVLIW Addressing", on page 15. Branch addresses use 16-bit relative or 32-bit absolute resolution.

6.4 Operation Specification

Mnemonics. The compute mnemonics are expanded versions of the original Mfast compute machine. The address, sequence, and control mnemonics are my own. Unused operation codes are policed as invalidities.

64k Special

Mapped toI/O Space

32 ArithRegsRegs

LoadAddr

BranchUnit

Gen

StoreAddrGen

MAUUnit

ALUUnit

DSUUnit

BusCtl4 Port

Memory

mVLIWDataIn

mVLIWDataOut

LoadStoreAddr

ExternalConnections

Data

Loop CtlIntr Ctl

IAR

BusCtl

Data

4 PortMemory

ExternalConnections

InstrInstr

SurrogateInstrMemory

Dec

SimAddr

MultiDecode

ExecuteControls

Multi-Ported


Instruction Structure January 10, 2003 Page 10 of 43

Types of Control. The control information consists of instructions, indicators, the machine control register (MCR) and the PSW.

7.0 Instruction Structure

Machine Language Syntax. The machine language syntax started out simple but became more complex as the operations were specified. It is kept as simple as possible reflecting the fact that most DSP code is still written largely in assembly language. The language is general and allows for 32-bit and 64-bit data path implementations. There are 9 main syntactic patterns as shown in Figure 4, "mVLIW Instruction Formats", on page 11.

Instruction Formats. Some of the fields have multiple names depending upon the specific opcode. The first two bits of the instruction and the Reserved fields ensure that code written for the mVLIW machine will execute properly on the IBM Mfast parallel processors. All critical fields are aligned to aid in implementation performance. Heavy use of Opcode extension fields have been utilized to reduce the number of opcodes required. The Rsv field is reserved for compatibility with the IBM Mfast parallel processors.



Figure 4. mVLIW Instruction Formats

Indicators. The indicators summarize machine exceptions and special supervisor calls. They are listed in

/*********************************************************************************//* IndicatorVLIW(): mVLIW indicators *//*********************************************************************************/IndicatorVLIW(){

machineCheck = 0;programCheck = 0;

31 15 0

00 Byte01 Half10 Word11 DWord

a 1 0 Opcode size p Rx Ry Rsv RtOpx

p = position1 = all units0 = aligned

d 1 0 Opcode size p Immediate 16b Rts

r = 1: relative branch

g 1 0 Opcode s r Rbranch RmaskCond RbaseOpx

Opx = Opcode ExtensionRsv = Reserved

s

s = 1: Signed Operation

f 1 0 Opcode s r Cond RbaseOpx11b Branch Address (2k)

?

c 1 0 Opcode size p Rx Imm5 Rsv RtOpx

e 1 0 Opcode Opx 16b Special Register Address Rt

b 1 0 Opcode size p Rx Ry Rsv RtOpxArit

hmet

icD

su M

ove

Imm

16Sp

ecia

lR

egs

Bra

nch

s

i 1 0 Opcode Opx Surrogate Addr (2k) Exec Ctl (8b) Cnt

1 0 Opcode size Rx RyCond RtOpxp

Con

ditio

nal

Mov

eV

LIW

h



invalidOperation = 0;priviledgedOperation = 0;execError = 0;protectionViolation = 0;invalidAddress = 0;invalidSpecification = 0;invalidDataPathOp = 0; /* For machines that don’t support certain accesses */supervisorCall = 0;

}

Code Listing 6. IndicatorVLIW(): mVLIW indicators

Status Format. The PSW contains the mode bits, interrupt mask bit, interrupt identifier, and priority ring of the task executing. The ALU_PSW contains alu status (overflow, carry, underflow, saturation) and branch information. The MAU_PSW contains multiplier status (overflow, carry, underflow, saturation) and branch information. In case of conflicting results, the ALU conditions are serviced first, then the multiplier is serviced.

/*********************************************************************************//* PswVLIW: mVLIW Processor Status Word Structure *//*********************************************************************************/struct PswVLIW /* max 2**16 entries */{

dataBaseRegs[32]; /* 32, 32-bit base regs for data store access */instBaseRegs[32]; /* 32, 32-bit base regs for instruction store access */IndicatorVLIW; /* Indicators embedded in I/O space */struct status{ /* critical status information */

instrLinkRegLocked; exclusiveModeCnt[32]; /* Exclusive Mode Timeout Counter */ring [2]; /* ring privilege (0-3) of task */origin[32]; /* task origin address */limit[32]; /* task address limit */dataPathSize; /* 32-bit or 64-bit task */interruptMask[4]; /* interrupt masks */interruptId[4]; /* unit that caused interrupt */IAR[32]; /* Instruction Address Register */ILR[32]; /* Instruction Link Register */BALCtl[32]; /* Branch and Link Control Word */SVA[32]; /* Supervisor Call Address */EXA[32]; /* Execute Address */XJA[32]; /* Exchange Jump Address */

}struct machineCtl{ /* machine control register */

initmVLIW; /* ipl */enableInterrupts[]; /* allow interrupts */enableExtInterrupts[]; /* allow external interrupts from host */exclusiveMode; /* set/clr exclusive mode */foreGroundMode; /* set/clr foreground/background mode */forceInstrIndexZero; /* Force Rindex to zero for inst addresses */bigEndianMode; /* Set Big Endian Mode */setRingPriv[]; /* Set Ring Privilege Level */

}InterruptAddr(); /* address to branch to for interrupt */struct AluStatus{

AluPsw[]; /* Alu Status: Zero, Carry, Oflow, Sign */AluCondition[]; /* 0, >, <, etc. */

}struct MauStatus{

MauPsw[]; /* Mau Status: Zero, Carry, Oflow, Sign */MauCondition[]; /* 0, >, <, etc. */

}struct AluCtl{

AluScale[]; /* Alu Scaling Factor */AluSat; /* Saturate Result */

}struct MauCtl{

MauScale[]; /* Mau Scaling Factors */MauRnd[]; /* round 0, round 1, random, truncate */MauSat; /* Saturate Result */

}struct LoadCtl{

LdEnableModulo; /* Enable Modulo Addressing */LdEnableSign; /* Consider address offsets as signed */LdIndexEnable; /* Force Index to 0 or enable */

}struct LoadStatus{{

LdModulo[32]; /* Load Modulo Value */LdIncrement[32]; /* Load Increment Value */



}struct StoreCtl{

StEnableModulo; /* Enable Modulo Addressing */StEnableSign; /* Consider address offsets as signed */StIndexEnable; /* Force Index to 0 or enable */

}struct StoreStatus{{

StModulo[32]; /* Store Modulo Value */StIncrement[32]; /* Store Increment Value */

}}

Code Listing 7. PswVLIW(): mVLIW Processor Status Words

Instruction List. The instruction list is shown in Table 3, “mVLIW Instruction List,” on page 10.

mVLIW Instruction List By OpcodeMultiplier Alu

MP Multiplier Product AL Alu LogicMA Multiplier Arithmetic AA Alu ArithmeticMB Multiplier Butterfly AB Alu ButterflyMPI Multiplier Product ImmediateMAI Multiplier Arithmetic Immediate AAI Alu Arithmetic ImmediateMC Multiplier Comparison AC Alu Comparison

Data Selector Unit Branch UnitDS Dsu Shift & Rotate BD Branch Direct on ConditionDF Dsu Special Functions BI Branch Indirect Condition/bitDM Dsu Move Register BS Branch Service RoutineDI Dsu Move Imm16

Store UnitLoad Unit SB Store Base/Offset/Displacement

LB Load Base/Offset/DisplacementSupervision / Misc

Programmable VLIW SV Supervisory InstructionsVX VLIW Execute SM Supervisory Move Special Regs To Arith.VD VLIW Delimiter Instructions IO External I/O: non-memory mappedVC VLIW Conditional Move

TABLE 5. mVLIW Instruction List By Opcode

mVLIW Instruction List By MnemonicsMultiplier Alu

MPY Multiplya

MAU Multiply AccumulateMAC Multiply Accumulate with CarryMPYI Multiply ImmediateMAUI Multiply Immediate & AccumulateMACI Multiply Immediate & Accumulate w/ Carry ACM Alu CompareMAD Multiplier Addb AAD Alu AddMAC Multiplier Add with Carry AAC Alu Add with CarryMSU Multiplier Subtract ASU Alu SubtractMSB Multiplier Subtract with Borrow ASB Alu Subtract with BorrowMAI Multiplier Add Immediate AAI Alu Add ImmediateMAIC Multiplier Add Immediate with Carry AAIC Alu Add Immediate with CarryMSI Multiplier Subtract Immediate ASI Alu Subtract ImmediateMSIB Multiplier Subtract Immediate w/ Borrow ASIB Alu Subtract Immediate w/ BorrowMB1 Multiplier Butterfly type 1 AB1 Alu Butterfly type 1MB2 Multiplier Butterfly type 2 ALC Alu Complement

ALA Alu AndData Selector Unit ALO Alu Or

DASL Dsu Arithmetic Shift Leftc ALNO Alu NorDASR Dsu Arithmetic Shift Right ALX Alu Exclusive OrDLSL Dsu Logical Shift Left ALNA Alu NandDLSR Dsu Logical Shift Right ALAnB Alu And (A and not(B)) DRL Dsu Rotate Left ALOnB Alu Or (A or not(B))DRR Dsu Rotate RightDRLC Dsu Rotate Left thru Carry Branch Unit

TABLE 6. mVLIW Instruction List By Mnemonics


Addressing January 10, 2003 Page 14 of 43

8.0 Addressing

Direct Addressing. All addressing is direct. Memory read and write is dependent upon the position (p) bit in the instruction field and the size field. The p bit describes whether 1 or all units are transferred. If p=0, a single unit specified by the size field (byte/half word/word/double word) is aligned. All addresses are checked for validity and memory protection by an origin and bound extent register.

8.1 Address MappingThe mVLIW requires address mapping. A base address (BR0 to BR31) that is in general modifiable only by

the supervisor or controlling host system allows a task to access up to 32 “relocation registers”. This allows subroutines and data to be shared and dynamically relocated. The basic address mode is shown in Figure 5, "mVLIW Addressing", on page 15.

DRRC Dsu Rotate Right thru Carry BDMA Br. Direct Mpy Cond. Absoluted

DMV Dsu Move Arithmetic Registere BDMR Br. Direct Mpy Cond. RelativeDMI Dsu Move Immediate 16-bits BDAA Br. Direct Alu Cond. AbsoluteDABS Dsu Absolute Value BDAR Br. Direct Alu Cond. RelativeDSN Dsu Signum BIMA Br. Indirect Mpy Cond Absolutef

DALG Dsu 1 + Log(|Ry|) BIMR Br. Indirect Mpy Cond RelativeDLG Dsu 1 + Log(Ry) BIAA Br. Indirect Alu Cond AbsoluteDLZ Dsu Count Leading Zeros BIAR Br. Indirect Alu Cond Relative

BIMBA Br. Indirect Mpy Bit AbsoluteLoad Unit BIMBR Br. Indirect Mpy Bit Relative

LBXOD Load Base/Index/Offset/Dispg BIABA Br. Indirect Alu Bit AbsoluteLBXO Load Base/Index/Offset BIABR Br. Indirect Alu Bit Relative

BAL Branch and Link (Call)Store Unit RTN Return

SBXOD Store Base/Index/Offset/DispSBXO Store Base/Index/Offset Supervision

SMSA Supervisor Move Special Reg to ArithProgrammable VLIW SMAS Supervisor Move Arithmetic to Special

VXE VLIW Execute SMBA Supervisor Move Base Reg to ArithmeticVDI VLIW Delimiter Instructions SMAB Supervisor Move Arithmetic Reg to BaseVCM VLIW Conditional Move BSV Branch Supervisor

BEX Branch ExecuteBEJ Branch Exchange Jump

External I/O RTI Return From InterruptIOR External I/O Read RTS Return From SupervisorIOW External I/O Write RTJ Return From Exchange JumpIOS External I/O StartIOX External I/O Stop

a. A typical instruction would be MPY H0(Rt,Rx,Ry): Multiply half word position=single half wordb. Shading denotes functions replicated across execution unitsc. For arithmetic shifts, the sign bit participates if the dsu signed mode is setd. 16 Branch conditions are specified including branch unconditionale. Multiple combinations exist for Dsu Move Register Typesf. 16 Branch conditions are specified including branch unconditionalg. A number of auto incrementing modes are available for Roff,Rdisp. Additionally, modulo addressing is specified by a mode reg.

mVLIW Instruction List By Mnemonics

TABLE 6. mVLIW Instruction List By Mnemonics


Addressing January 10, 2003 Page 15 of 43

Figure 5. mVLIW Addressing

Limit Registers. Limit registers are provided to indicate the extent of the allocated area and prevent errant memory accesses.1

8.2 Address ModificationAddresses are modified with base, index, offset, and displacement. Offset and displacement are 16-bit values

concatenated from Rx. A mode register for enabling modulo addressing of Roff/disp is available. A mode register also controls whether the index register is active or forced to zero. The base indicies are writable only by the supervisor if the supervisor mode is enabled. Since up to 32 base registers (relocation registers) can be specified, common subroutine libraries can be easily linked by any program. Furthermore, common shared data areas can be used to pass data between multiprocessors.

8.3 Index Arithmetic

Index Operations. The use of general purpose registers for both data and addresses eliminate the requirement for separate index instructions. However, many signal processing applications require parallel generation of the address to perform high-speed load/store operations. The mVLIW provides a separate Load Unit and a separate Store Unit to facilitate concurrent address generation during execution2.

1. An implementation should provide as many limit registers as base registers.2. Implementation Note: Depending upon the pipeline implementation, it may be necessary to ensure the address is valid at the end of decode. This implies that the use

of a register for indexing immediately after an arithmetic execute requires forwarding to the appropriate unit for the index to be properly interpreted.

Base/Index/Offset/Displacement 32-bit Addressing

Rt

Rx

16 / 32-bit Address

+

base/relocation

offset displacement

Ry index

Options

EA = Rbase + Rindex + Roff,RdispEA = Rbase + Rindex + RoffEA = Rbase + RindexModulo Addressing is Provided

offset displacement

+

Post Increment

incrreg


Data January 10, 2003 Page 16 of 43

8.4 Address levelAll addresses are direct as described in Section 8.0, "Addressing" on page 14.

Indirect Addressing. There is no indirect addressing.

Immediate Addressin. There is no Immediate Addressing. It should be noted that this could be added using instruction format c with an 8-bit immediate field. I have chosen not to use 2 opcodes on this option.

9.0 DataThere is no explicit provision for character strings.

9.1 Logical

Logical Formats. Logical vectors are provided for. The size is specified by the size field in conjunction with with the p-bit for selecting all the sizes or a single unit. Results can be tested for equal 0, not equal 0, and bit values set.


Data January 10, 2003 Page 17 of 43

9.2 Fixed-Point Numbers

Figure 6. mVLIW Fixed-point Data Types

Notation and allocation. The mVLIW provides binary radix-complement encoding. Results can be tested for up to 16 conditions. Additionally, maskable interrupt handlers can be invoked based upon overflow or underflow. As with all mVLIW datatypes, the fixed-point numbers can be treated as bytes, half-words, words, or double words depending upon the size field. Additionally, the p-bit describes the position and alignment of multiple units.

9.3 Floating-Point NumbersThere is no explicit provision for floating-point numbers.

Fixed-point Data Types : 32-bit Datapath

x...x 8b val

8b val8b val8b val8b val

size p

B 0

B 1

x...x 16b valH 0

H 1 16b val16b val

32b valW 0

W 1 Undefined: Signal Error

32b val HighD 0

32b val Low

Implied odd/even Reg Pair

D 1 Undefined: Signal Error

size pA31 A0

Fixed-point Data Types : 64-bit Datapath

x...x 8b

8b8b8b8b

size p

B 0

B 1

H 0

H 1

W 0

W 1

D 0

D 1

size pA63 A0

8b 8b 8b 8b

16bx...x

16b16b16b16b

x...x 32b val

32b val32b val

64b val

64b val High

64b val Low

Implied odd/even Reg PairNote: A Mode Reg Specifies a 64-bit machinerunning a 32-bit task.


Operations January 10, 2003 Page 18 of 43

10.0 Operations

10.1 Data HandlingThere are no memory-to-memory move operations. All transfers to and from memory are by register access.

In keeping with the RISC philosophy, this facilitates single-cycle local memory accesses. The length of the operands is specified by the size field in conjuction with the p-bit. Either all or a single unit participate in the operation.

/* ----------------------------------------------------------------------------- *//* DMV: dsu move arithmetic register *//* b Size p Opx *//* b B 1 001 Byte Swap (LE to BE conversion) *//* b H 0 Hx Hy Move half word Hx to half word Hy *//* b H 1 001 Half word swap *//* b W 0 Wx Wy Move word Wx to word Wy *//* b W 1 001 Word swap *//* b D 1 000 Move Rx to Ry (copy) *//* ----------------------------------------------------------------------------- *//*********************************************************************************//* DMVH0(): Dsu Move Arithmetic Register Single Half Word *//* DMVH01(): Dsu Move Arithmetic Register Low Half Word to High Half Word *//* DMVSH1(): Dsu Move Arithmetic Register Swap Half Word *//*********************************************************************************/#define DMVH0(rt,rx) \{ \

sp[0].ctlRegs.hidden[0] = rt; \rt = 0x0000FFFF & rx; \rt = sp[0].ctlRegs.hidden[0] | rt; \DMVH0cnt++; \rtClk++; \

}#define DMVH01(rt,rx) \{ \

sp[0].ctlRegs.hidden[0] = rt;rt = (0x0000FFFF & rx) << 16; /* low to high */ \sp[0].ctlRegs.hidden[0] = sp[0].ctlRegs.hidden[0] & 0x0000FFFF; \rt = rt | sp[0].ctlRegs.hidden[0]; \MVH01cnt += 2; \rtClk++; \

}#define DMVSH1(rt,rx) \{ \

sp[0].ctlRegs.hidden[0] = rt; \rt = (0x0000FFFF & rx) << 16; /* low to high */ \sp[0].ctlRegs.hidden[0] = sp[0].ctlRegs.hidden[0] >> 16; \sp[0].ctlRegs.hidden[0] = sp[0].ctlRegs.hidden[0] & 0x0000FFFF; \rt = rt | sp[0].ctlRegs.hidden[0]; \DMVSH1cnt += 2; \rtClk++; \

}

Code Listing 8. mVLIW move instructions

10.2 Logic

Connectives. A rich set of connectives is provided as shown in Figure 9, "mVLIW Logic Instructions", on page 19.

Data Selector UnitDASL Dsu Arithmetic Shift Left DRR Dsu Rotate RightDASR Dsu Arithmetic Shift Right DRLC Dsu Rotate Left thru CarryDLSL Dsu Logical Shift Left DRRC Dsu Rotate Right thru CarryDLSR Dsu Logical Shift Right DMV Dsu Move Arithmetic RegisterDRL Dsu Rotate Left DMI Dsu Move Immediate 16-bits

TABLE 7. mVLIW Data Handling Instructions



All operations are available for all supported sizes and interpreted base on the size field and the p-bit.

/* ----------------------------------------------------------------------------------- *//* AL: alu logic operations Fcn set by Opx Field *//* C: complement X: XOR *//* A: AND NA: NAND *//* O: OR AnB: AND NOT B *//* NO: NOR OnB: OR NOT B *//* *//* b CB: complement byte *//* b CH complement half word *//* b CW: complement word *//* b CD: complement double word *//* b AB: AND byte *//* b AH: AND half word *//* b AW: AND word *//* b AD: AND double word *//* etc. for other logic ops *//* -------------------------------------------------------------------------------------- *//******************************************************************************************//* ALAB0(): Alu AND Single Byte *//* ALAB1(): Alu AND All Bytes *//* ALAW0(): Alu AND Single Word *//* ALAW1(): Alu AND All Words (=ALAW0 in 32-bit machine) *//******************************************************************************************/#define ALAB0(rt,rx,ry) \{ \

sp[0].ctlRegs.hidden[0] = 0x000000FF & rx & ry; \rt = (rt & 0xFFFFFF00) | sp[0].ctlRegs.hidden[0]; \ALAB0cnt++; \rtClk++; \

}#define ALAB1(rt,rx,ry) \{ \

rt = rx & ry; \ALAB0cnt += 4; \rtClk++; \

}#define ALAW0(rt,rx,ry) \{ \

rt = rx & ry; /* Bit AND */ \ALAW0cnt++; \rtClk++; \

}#define ALAW1(rt,rx,ry) \{ \

rt = rx & ry; /* only 1 word in 32-bit implementation */ \ALAW1cnt++; \rtClk++; \

}

Code Listing 9. mVLIW Logic Instructions

Shift / Rotate. A full set of shift and rotate instructions are provided including arithmetic and logical shifts. Signed arithmetic is also provided.

/* ----------------------------------------------------------------------------- *//* DSR: dsu shift & rotate *//* b Opx: ASR 000 Arithmetic Shift Right *//* b ASL 001 Arithmetic Shift Left *//* b LSR 010 Logical Shift Right *//* b LSL 011 Logical Shift Left *//* b ROR 100 Rotate Right *//* b ROL 101 Rotate Left *//* b RORC 110 Rotate Right thru Carry *//* b ROLC 111 Rotate Left thru Carry *//* ----------------------------------------------------------------------------- *//*********************************************************************************/

Logical ConnectivesC Complement X XorA And NA NandO Or AnB A And not(B)No Nor OnB A Or not(B)

TABLE 8. mVLIW Connectives



/* DLSRW0(): Dsu Logical Shift Right Single Word *//*********************************************************************************/#define DLSRW0(rt, rx, ry) \{ \

rt = rx >> ry; \ DLSRW0cnt++; \ rtClk++; \} \

Code Listing 10. mVLIW Shift and Rotate Instructions.

Figure 7. mVLIW Logical Shift Example With Byte Selections

10.3 Fixed-Point ArithmeticFixed-point arithmetic is register to register. The memory operand for load/store varies based on the size field

and the position bit (p). Sign is available at the Instruction level. Saturation, rounding, and carry control are embedded in special registers. A large variety of conditions are detected including overflow, underflow, and zero. Add, subtract, and butterfly type-1 are available in both the Multiplier unit and the ALU. Most conditional branches are performed in the Alu. It is possible to branch based upon the multiplier output as well. In the case of conflicting indicators, the Alu interrupts are serviced first.

mVLIW Fixed Point InstructionsMultiplier Alu

MPY Multiplya

MAU Multiply AccumulateMAC Multiply Accumulate with CarryMPYI Multiply ImmediateMAUI Multiply Immediate & AccumulateMACI Multiply Immediate & Accumulate w/Carry ACM Alu CompareMAD Multiplier Addb AAD Alu AddMAC Multiplier Add with Carry AAC Alu Add with CarryMSU Multiplier Subtract ASU Alu SubtractMSB Multiplier Subtract with Borrow ASB Alu Subtract with BorrowMAI Multiplier Add Immediate AAI Alu Add ImmediateMAIC Multiplier Add Immediate with Carry AAIC Alu Add Immediate with CarryMSI Multiplier Subtract Immediate ASI Alu Subtract ImmediateMSIB Multiplier Subtract Immediate w/Borrow ASIB Alu Subtract Immediate w/BorrowMB1 Multiplier Butterfly type 1 AB1 Alu Butterfly type 1MB2 Multiplier Butterfly type 2 ALC Alu Complement

TABLE 9. mVLIW Fixed-point Instructions

Rotate Left Example

size = Byte

Rx shift operand

p = 0

data 10010001

Ry shift value

x...x 00000011

Rt rotated result

data 10001100

size = Byte

Rx shift operand

p = 1

10010001

Ry shift value

Rt 4 independent shifted results

100100011001000110010001

00000001000000100000001100000100

00100011010001101000110000011001



Load and Store. Transmission is between memory and registers.

/* ------------------------------------------------------------------------------*//* Load / Store: *//* *//* LB: load base/index/offset/displacement (requires 3-bit adder) *//* Rindex is special register *//* Rx = Rbase Roff = Ry(h1) Rdisp = Ry(h0) *//* *//*modes: modulo[1]: 0 no modulo addressing *//* 1 modulo addressing enabled *//* - Uses SPRmodulo[2] *//* - If modulo set, increment is by modulo reg *//* index[1]:0 Force *Rindex = 0 *//* sign[1]: 0 Treat Roffset Rdisplacement as unsigned *//* 1 Treat Roffset Rdisplacement as signed *//* *//* Opx *//* 000 EA = Rbase + Rindex + Roff,Rdisp *//* Roff = Roff + Rinc(h1) *//* Rdisp = Rdisp + Rinc(h0) *//* *//* 001 EA = Rbase + Rindex + Roff *//* Roff = Roff + Rinc *//* *//* 010 EA = Rbase + Rindex + Roff,Rdisp *//* Rdisp = Rdisp + Rinc(h0) *//* if( Rdisp rolls over ) *//* Roff = Roff + Rinc(h1) *//* *//* 011 EA = Rbase + Rindex + Roff,Rdisp *//* Roff = Roff + Rinc(h1) *//* if( Roff rolls over ) *//* Rdisp = Rdisp + Rinc(h0) *//* *//* b B: byte *//* b HW: half word *//* b W: word *//* b DW: double word *//* ----------------------------------------------------------------------------- *//*********************************************************************************//* LB0B0(): Opx = 0, 1 Byte *//*********************************************************************************/#define LB0B0(rt,rx,ry) /* rx = base, ry = off/disp, rt = target */ \{ /* Rinc & Rindex = special regs */ \

if( sp[0].ctlRegs.indexCtl == 0 ) Rindex = 0; \sp[0].ctlRegs.addrLoad = rx + ry + Rindex;/* WARNING: No Sign check yet */\ADDRESS_CHECK(sp[0].ctlRegs.addrLoad); /* invalid address signal */ \rt = sp[0].dMem[ sp[0].ctlRegs.addrLoad ];/* Get Data */ \BYTE_ENABLES( 0b0001 );/* only latch low byte (B0 type instr) */ \/* Update Address */ \if( sp[0].ctlRegs.moduloCtl == 0 )/* no modulo */ \{ \

Roff = Roff + RincHigh; \Rdisp = Rdisp + RincLow; \

} \else if( sp[0].ctlRegs.moduloCtl == 1 ) \{ \

Roff = (Roff + RincHigh) % RmodHigh; \Rdisp= (Rdisp + RincLow) % RmodLow; \

} \

ALA Alu AndData Selector Unit ALO Alu Or

DABS Dsu Absolute Value ALNO Alu NorDSN Dsu Signum ALX Alu Exclusive OrDALG Dsu 1 + Log(|Ry|) ALNA Alu NandDLG Dsu 1 + Log(Ry) ALAnB Alu And (A and not(B)) DLZ Dsu Count Leading Zeros ALOnB Alu Or (A or not(B))

Load Unit Store UnitLBXOD Load Base/Index/Offset/Dispc SBXOD Store Base/Index/Offset/DispLBXO Load Base/Index/Offset SBXO Store Base/Index/Offset

a. A typical instruction would be MPYH0( Rt,Rx,Ry): Multiply half word position=single half wordb. Shading denotes functions replicated across execution unitsc. A number of auto incrementing modes are available for Roff,Rdisp. Additionally, modulo addressing is specified by a mode reg.

TABLE 9. mVLIW Fixed-point Instructions



ry = (Roff & 0xFFFF0000) | (Rdisp & 0x0000FFFF); \LB0B0cnt += 5; /* Load, Add Roff, Add Rdisp, Mod Roff, Mod Rdisp */ \rtClk++; \

}

Code Listing 11. mVLIW Load/Store Instructions.

Add and Subtract. Fixed-point add and subtract are provided for all forms./* ----------------------------------------------------------------------------- *//* MAD: multiplier add *//* MAC: multiplier add with carry *//* MSU: multiplier subtract *//* MSB: multiplier subtract with borrow *//* a B: Add byte (b0|b1|b2|b3) *//* a HW: Add half word *//* a W: Add words *//* a DW: Add double word *//* *//* AAD: alu add *//* AAC: alu add with carry *//* ASU: alu subtract *//* ASB: alu subtract with borrow *//* ACM: alu compare (only affects flags not register contents) *//* a B: Add byte *//* a HW: Add half word *//* a W: Add word *//* a DW: Add double word *//* ----------------------------------------------------------------------------- *//* MADI: multiplier add immediate *//* MACI: multiplier add immediate with carry *//* MSUI: multiplier subtract immediate *//* MSBI: multiplier subtract immediate with borrow *//* c B: Add byte (b0|b1|b2|b3) *//* c HW: Add half word *//* c W: Add words *//* c DW: Add double word *//* *//* AADI: alu add immediate *//* AACI: alu add immediate with carry *//* ASUI: alu subtract immediate *//* ASBI: alu subtract immediate with borrow *//* c B: Add byte (b0|b1|b2|b3) *//* c HW: Add half word *//* c W: Add words *//* c DW: Add double word *//* *//* *//* Scaling modes are set in the SASR (Sequencer Alu Scaling Register) *//*********************************************************************************//* AADW0(): Alu Add Single Word *//*********************************************************************************/#define AADW0(rt,rx,ry) \{ \

CASE( sign, saturation, round );/* arithmetic type */ \SHIFT( sp[0].ctlRegs.SASR[1] == 0 );/* shift values */ \rt = (rx + ry) << shiftVal; \AADW0cnt++; \if( shiftVal > 0 ) exOpCnt++; \rtClk++; \

}

Code Listing 12. mVLIW Add/Subtract Instructions

Multiplication and Add. The mVLIW provides a multiply and accumulate in addition to a multiply only option. The multiply accumulate option is a key feature of inner product signal processing applications. Additionally, a scaling factor can be applied based upon the Multiplier Control Register.

/* ----------------------------------------------------------------------------- *//* MPY: multiply only. No accumulate RxRy *//* MAU: multiply accumulate RxRy + Rm *//* MAC: multiply and accumulate with carry *//* a B: Multiply byte *//* a HW: Multiply half words *//* a W: Multiply words *//* a DW: Multiply Double words (only valid on 64-bit machines) *//* ----------------------------------------------------------------------------- *//* MPYI: multiply immediate only. No accumulate RxRy */



/* MAUI: multiply immediate & accumulate RxRy + Rm *//* MACI: multiply immediate & accumulate with carry *//* c B: Multiply byte *//* c H: Multiply half words *//* c W: Multiply words *//* c D: Multiply Double words (only valid on 64-bit machines) *//*********************************************************************************//* MAUB0(): Multiply Accumulate a Byte *//*********************************************************************************/#define MAUB0(rt,rx,ry) \{ \

CASE( sign, saturation, round );/* arithmetic type */ \SHIFT( sp[0].ctlRegs.SMSR[1] == 0 );/* shift values */ \sp[0].ctlRegs.hidden[0] = rt & 0xFFFFFF00; \rt = ((short)(rx*ry + rm)) << shiftVal \SIGNAL( ByteOverFlow ); /* signal if oflow */ \rt = rt & 0x000000FF; /* make it a byte val */ \rt = sp[0].ctlRegs.hidden[0] | rt;/* restore other bytes */ \MAUWcnt++; exOpCnt++; \if(sp[0].ctlRegs.SMSR[0] > 0) exOpCnt++; \rtClk++; \

}

Code Listing 13. mVLIW Mutliply/Multiply Accumulate Instructions.

Compare. A full set of compare functions are provided for in the ALU. They are specified in Figure 12, "mVLIW Add/Subtract Instructions", on page 22. There are no restrictions in usage and the only difference between subtract and compare is the writing of results to the target register is surpressed with the compare.

Special Functions. Log(x), count leading zeros, and single-cycle multiplies are provided. Additionally, a special butterfly option is provided for signal processing applications. The butterfly instruction is particularly useful for Fourier Transform techniques.

/* ----------------------------------------------------------------------------- *//* MB1: multiplier Butterfly type 1: (Rx+Ry,Rx-Ry) *//* MB2: multiplier Butterfly type 2: (Rm+RxRy,Rm-RxRy) *//* a B: Butterfly byte *//* a HW: Butterfly half word *//* a W: Butterfly word *//* a DW: Butterfly double word *//* *//* AB1: alu Butterfly type 1: (Rx+Ry,Rx-Ry) *//* AB2: Alu Butterfly type 2: (Rm+RxRy,Rm-RxRy) not supported *//* a B: Butterfly byte *//* a HW: Butterfly half word *//* a W: Butterfly word *//* a DW: Butterfly double word *//* ----------------------------------------------------------------------------- *//* DFN: dsu special functions *//* b CLZ 0000 |Ry| *//* b CLZ 0001 Signum(Ry) *//* b CLZ 0010 1 + Log(|Ry|) *//* b CLZ 0011 1 + Log(Ry) *//* b CLZ 0100 Count Leading Zeros *//* B: byte *//* HW: half word *//* W: word *//* DW: double word *//* ----------------------------------------------------------------------------- */

Code Listing 14. mVLIW Special Fixed-point Compute Instructions

Parallel Constructs. Three special VLIW constructs are provided to allow concurrent operation and programmable VLIW capabilities.

mVLIW Parallel ConstructsVXE VLIW Execute VCM VLIW Conditional MoveVDI VLIW Delimiter Instruction

TABLE 10. mVLIW Parallel Constructs.


Instruction Sequencing January 10, 2003 Page 24 of 43

/*********************************************************************************//* VXE: VLIW Execute *//* VDI: VLIW Segment Delimiter Instruction *//* Opx *//* i 000 VXE *//* i 000 VDI - Warning: Field names are same but meaning is different *//*********************************************************************************//* VCM: VLIW Conditional Move *//* h *//*********************************************************************************//* VXE(): VLIW Execute *//* VDI(): VLIW Delimiter Instruction *//* VCM(): VLIW Conditional Move *//*********************************************************************************/#define VXE(simAddr, cnt, execCtl) \{ \

while( !cnt ) /* Repeat Control */ \{ \

for( i=0; i<execution_units; i++ ) \{ \

GetInstruction( simAddr[i] );/* Bring back each exec unit’s instr */\if( !execCtl[i] ) /* Force a NOP? */ \

ExecuteInstruction(); /* Warning, Loads must go first */ \} \

} \VXEcnt += numRealInstructionsExecutedInParallel; \rtClk++; \

}#define VDI(addr, cnt, execCtl) \{ \

while( !cnt ) /* for every instruction in the list */ \{ \

PutInstruction( simAddr[addr][decodedSlot] );/* write 32-bits */ \if( execCtl == executeIt )/* can execute while writing */ \

ExecuteInstruction(); \} \VDIcnt++; \rtClk++; \

}#define VCMW0(Opx, rt, rx, ry) \{ \

if( ExecUnit[Opx] == cond ) \rt = rx; \

else \rt = ry; \

VCMW0cnt++; \rtClk++; \

}

Code Listing 15. mVLIW Parallel (VLIW) Instructions.

11.0 Instruction SequencingPossible interrupts are taken before each instruction. When an external interrupt occurs, the pending interrupts

are registered, but are not executed.

/*********************************************************************************//* BasicCycle(): mVLIW Basic Cycle *//*********************************************************************************/BasicCycleVLIW(){

while( !wait || !halt ){

InterruptVLIW(); /* Interrupt */InstructionFetchVLIW(); /* Instruction Fetch */

}}

Code Listing 16. BasicCycleVLIW(): mVLIW basic cycle.

/*********************************************************************************//* InstructionFetchVLIW(): mVLIW Instruction Fetch *//*********************************************************************************/InstructionFetchVLIW(){

/* Generate Next Instruction Address *//* If invalid, signal error */

}



Code Listing 17. InstructionFetchVLIW(): mVLIW instruction fetch.

Next Instruction. The instructions are assumed to be word aligned on 32-bit boundaries. Instructions are fetched from Instruction Store every cycle. A Harvard architecture assures concurrent instruction and data fetches.

Sequencing Instructions. The sequencing instructions come in two basic types: branch direct and branch indirect. In either case, the branch can be specified relative to the program counter or as an absolute address. A rich set of branch conditions are provided. To ease implementation concerns, only logical operations result in “hot branches”1. Additionally, all implementations must adhere to filling only one delay slot before the branch is executed. The compiler or assembler must issue warnings and fill subsequent delay slots with NOPs2.

11.1 Linear Sequence

Unconditional Branch. An unconditional branch is a case of the Branch Direct and Branch Indirect instructions.

11.2 Decision

Conditional Branching. The branch condition is selected based on the Condition bits in the opcode. The execution unit that caused the branch is selected from the Opcode Extension field. Branching is effected by changing the Instruction Address Register. Any implementation must only assume that one delay slot is available and that it will always execute. Conditional branches may be relative, direct, or indirect. A separate Opcode is provided for system service calls. An Instruction Base Register offset is added to each branch address before the branch is taken.

/*********************************************************************************//* BDR: Branch Direct *//* r s *//* 0 x Branch to Direct Address *//* 1 0 Branch Relative with positive offset *//* 1 1 Branch Relative with negative offset *//* *//* Conditions: *//* U = Set of all 32 output bits *//* Z = U is all zeros *//* S = Sign bit of U *//* Y = Carry out of execute unit *//* V = Overflow detected *//* A = Input A for A-B operation *//* B = Input B for A-B operation *//* *//* GT,GE,LE,LT apply to signed numbers *//* ie: LT = S xor V L = not(Carry) *//* *//* Opx Cond *//* f M_P 000 0000 Positive U > 0 *//* f M_NN 000 0001 Not Negative U >= 0 *//* f M_NP 000 0010 Not Positive U <= 0 *//* f M_N 000 0011 Negative U < 0 */

1. A hot-branch is a branch in which the branch address is determined during execute and taken during the same cycle. Preferably, branch addresses are calculated dur-ing decode.

Branch UnitBDMA Br. Direct Mpy Cond. Absolutea

a. 16 Branch conditions are specified including branch unconditional for the Direct Branches

BIMBA Br. Indirect Mpy Bit AbsoluteBDMR Br. Direct Mpy Cond. Relative BIMBR Br. Indirect Mpy Bit RelativeBDAA Br. Direct Alu Cond. Absolute BIABA Br. Indirect Alu Bit AbsoluteBDAR Br. Direct Alu Cond. Relative BIABR Br. Indirect Alu Bit RelativeBIMA Br. Indirect Mpy Cond Absoluteb

b. 16 + 8 logical branch conditions are specified including branch unconditional for Indirect Branches

BAL Branch and Link (Call)BIMR Br. Indirect Mpy Cond Relative BSV Branch SupervisorBIAA Br. Indirect Alu Cond Absolute BEX Branch ExecuteBIAR Br. Indirect Alu Cond Relative BEJ Branch Exchange Jump

TABLE 11. mVLIW Branch Instructions

2. This restriction allows the same binary code to run on multiple implementations.



/* f M_H 000 0100 Higher A > B *//* f M_NL 000 0101 Not Less Than A >= B *//* f M_NH 000 0110 Not Higher A <= B *//* f M_L 000 0111 Less than A < B *//* f M_NZ 000 1000 Not Zero U <> 0 *//* f M_UC 000 1001 Unconditional *//* f M_Z 000 1010 Zero U = 0 *//* f M_V 000 1011 Overflow *//* f M_GT 000 1100 Greater Than A > B *//* f M_GE 000 1101 Greater Equal A >= B *//* f M_LE 000 1110 Less Equal A <= B *//* f M_LT 000 1111 Less Than A < B *//* *//* e A_P 001 0000 Positive U > 0 *//* etc. for reset of ALU *//* ----------------------------------------------------------------------------- *//* BI: Branch Indirect *//* r s *//* 0 x Branch to Address pointed to in Rbr *//* 1 0 Branch Relative with positive offset in Rbr *//* 1 1 Branch Relative with negative offset in Rbr *//* *//* Opx = 000 or 001 (same conditions as BD) except type f instruction *//* Opx = 010: The only hot branches are logical *//* Opx Cond *//* g M_BANZ 010 0000 And Not Zero Rx & Rmask <> 0 *//* g M_BAZ 010 0001 And Zero Rx & Rmask = 0 *//* g M_BONZ 010 0010 Or Not Zero Rx | Rmask <> 0 *//* g M_BOZ 010 0011 Or Zero Rx | Rmask = 0 *//* g M_BXNZ 010 0100 Xor Not Zero Rx xor Rmask <> 0 *//* g M_BXZ 010 0101 Xor Zero Rx xor Rmask = 0 *//* g M_BXNNZ 010 0110 Xnor Not Zero Rx xnor Rmask <> 0 *//* g M_BXNZ 010 0111 Xnor Zero Rx xnor Rmask = 0 *//* *//* g A_BANZ 011 0000 etc. for rest of ALU *//* ----------------------------------------------------------------------------- *//*********************************************************************************//* BDAA0(): Branch Direct Alu Condition Absolute Unsigned *//* BDAR0(): Branch Direct Alu Condition Relative Unsigned *//* BIABA0(): Branch Indirect Alu Bit Absolute Unsigned *//*********************************************************************************/#define BDAA0(addr, cond, rbase) \{ \

if( aluResult == cond ) \sp[0].ctlRegs.Iaddr = rbase + addr; \

BDAA0cnt++; \rtClk++; \

}#define BDAR0(addr, cond, rbase) \{ \

if( aluResult == cond ) \sp[0].ctlRegs.Iaddr = sp[0].ctlRegs.Iaddr + rbase + addr; \

BDAR0cnt++; \rtClk++; \

}#define BIABA0(rbranch, rmask, cond, rbase) \{ \

if( cond == 0b000 ) \if( (aluResult & rmask) != 0x00000000 ) \

sp[0].ctlRegs.Iaddr = sp[0].ctlRegs.Iaddr + base + rbranch; \if( cond == etc. ) ... \BIABA0cnt += 2; \rtClk++; \

}

Code Listing 18. mVLIW Conditional Branching.

11.3 Iteration

Incrementation and Termination. Iteration is by comparison in the Alu or Multiplier. There are no special instructions required.


Supervision January 10, 2003 Page 27 of 43

11.4 Delegation/*********************************************************************************//* BS: Branch Service Routine *//* g BAL/RTN r00 Branch and Link r=0: Branch r=1: Return *//*********************************************************************************//* BAL(): Branch and Link *//*********************************************************************************/#define BAL(rbranch, rmask, rbase) \{ \

SAVE_STATUS(); \sp[0].ctlRegs.Iaddr = rbase + rbranch; \sp[0].ctlRegs.passInfo = rmask;/* control word */ \BALcnt++; \rtClk++; \

}

Code Listing 19. mVLIW Delegation Instructions

Call. Special service routines are provided to allow for operating system support. Branch and Link allows small modules that are not required to be shared have subroutine support. Routines that require access to I/O space and other protected memory locations must use Branch Supervisor calls.

Stack. There is no stack provided. The ramifications of this approach are discussed in Section 14.0, "Discussion" on page 29.

12.0 Supervision

12.1 ConcurrencyAny number of I/O channels may be in concurrent operation with the mVLIW processor.

Processor Interconnection. Concurrent operation of mVLIW machines is obtained by connecting multiple processors to a common bus. Parallel operation may also be achieved by upgrading to an nxm IBM Mfast parallel processing solution1.

12.2 Process InteractionAll process interaction is by shared memory.

Critical Section. A process with the correct Ring access level may request exclusive control of the processor to perform the equivalent of a Test and Set instruction. When Exclusive Mode is entered, all I/O writes to the requested memory blocks are disabled. This ensures that common memory locations can only be modified by one processor in the system. A timer is enabled when Exclusive mode is entered that limits the time the processor may be in Exclusive mode. This avoids I/O errors (time-outs) in the system. Exclusive Mode is entered by a loading the Machine Control Register Exclusive Mode bit with a 1. I/O operations and interrupts are automatically disabled. Issuing I/O commands or interrupts during Exclusive Mode operation results in a machine check.

1. Note that the mVLIW can be considered to be a 1x0 Mfast processor.


Supervision January 10, 2003 Page 28 of 43

12.3 Integrity

/*********************************************************************************//* BS: Branch Service Routine *//* g BSV/RTS r01 Branch Supervisor r=0: Branch *//* g BEX/RTX r10 Branch Execute r=1: Return *//* g BEJ/RTJ r11 Branch Exchange Jump *//*********************************************************************************//* SMSA: Supervisor Move Special Reg to Arithmetic Reg *//* SMAS: Supervisor Move Arithmetic to Special Reg *//* SMBA: Supervisor Move Base Reg to Arithmetic *//* SMAB: Supervisor Move Arithmetic Reg to Base Reg *//* e *//* These instructions only provide a mnemonic name mapping from I/O *//* space to the phyical address. In all cases, a register *//* containing the correct value must be available *//*********************************************************************************/#define SMBA( rt, addr ) \{ \

ADDRESS_CHECK(); /* is it in base reg I/O range? */ \rt = SpecialRegs(addr); \SMBAcnt++; \rtClk++; \

}

Code Listing 20. mVLIW Supervisory Instructions

Suppression. Various instructions are suppressed if they attempt to modify restricted registers without the appropriate Ring authority. Instructions are also suppressed if they execute in an invalid mode or if they access memory outside a valid range.

Protection. Integrity is maintained by providing Operating Rings for various tasks. Ring 0 is the highest priority. A user task runs in Ring 3 (the lowest level).

Priviledged Operations. Priviledged operations are provided for Ring 0 tasks.

12.4 Control Switching

Interruption. Vectored internal interrupts are provided to handle exceptions within the processor. Interrupt location vectors are mapped into I/O space based on the level of interrupt triggered. During an interrupt, the instruction just fetched is forced to a NOP, the mVLIW is forced into foreground mode locking the return address into the Instruction Link Register. When a Return From Interrupt is executed, the processor returns to a background state and resumes normal operation. The mVLIW can also initiate an external or internal interrupt by executing the Interrupt instruction.

SupervisionSMSA Supervisor Move Special Reg to Arith BSV Branch SupervisorSMAS Supervisor Move Arith. to Special Reg BEX Branch ExecuteSMBA Supervisor Move Base Reg to Arithmetic BEJ Branch Exchange JumpSMAB Supervisor Move Arithmetic Reg to Base RTI Return From InterruptINT Throw Interrupt RTS Return From Supervisor

RTJ Return From Exchange Jump

TABLE 12. mVLIW Supervisory Instructions

Priviledged OperationsIOR External I/O Read SMSA Supervisor Move Special Reg to ArithIOW External I/O Write SMAS Supervisor Move Arith. to Special RegIOS External I/O Start SMBA Supervisor Move Base Reg to ArithmeticIOX External I/O Stop SMAB Supervisor Move Arithmetic Reg to BaseL/S Load Stores to key memory areas INT Throw InterruptRTI Return From Interrupt BEX Branch ExecuteRTS Return From Supervisor BEJ Branch Exchange JumpRTJ Return From Exchange Jump

TABLE 13. mVLIW Priviledged Operations


Input/Output January 10, 2003 Page 29 of 43

The latter may facilitate debugging.

Humble Access. The mVLIW provides humble access by the Branch Supervisor instruction. The Rmask field points to a control field for the operating system.

Dispatching. Suspended and new programs can be loaded by the host processor or by the Kernal with Ring Level 0 priviledge.

12.5 State Preservation

Context Switching. All contexts must be saved by programming.

Initial Program Load. The mVLIW is designed to function as a co-processor in a PC-based system. The host processor is responsible for loading the Instruction Store with the Ring 0 Kernal and initiating a start-up sequence.

13.0 Input/OutputGeneral I/O is handled by the host x86 processor. Local devices may be memory mapped into the Local I/O

space which Ring 1 tasks have access to. Additionally, a simple channel mechanism is provided to allow the mVLIW processor to communicate to shared memory on the x86 host processor system.

These channels are intended to be primarily DMA channels to an off-chip PCI (Peripheral Component Interface) peripheral device.

14.0 DiscussionIn designing the mVLIW processor the overriding consideration was performance. It was apparant that the

inverse DCT was compute limited based on analysis of an early mVLIW model running both an MPEG-1 (comet-anim2.mpg) stream and an MPEG-2 stream (test.m2g). The computational complexity of the iDCT strongly encouraged the use of a single cycle multiplier. The other interesting feature that emerged from the intial model was that there were a lot of independent calculations being computed. This suggested that the iDCT could be computed in parallel. Therefore, the mVLIW incorporated VLIW concepts to allow multiple execution units to operate in parallel. Additionally, to keep the computational engine balanced, both a load and a store unit were incorporated into the VLIW surrogate instruction. Finally, the surrogate instruction concept was based on the fact that most VLIW devices are expensive because they must have large instruction busses into the chip. To avoid this expense, we decided to store all instructions as 32-bit quantities and program the VLIWs on the fly. This turned out to actually work very well. For the SIMD applications that are embarrasingly parallel, there is a small latency to build the VLIW word but then they can execute many times. In practice the latency is not significant since the VLIW instructions are able to be executed as they are programmed. In effect, you fill the pipeline while you are loading the VLIW instructions and then execute them many times.

The following topics are discussed in a loosely related order. These are the primary issues I dealt with when designing the mVLIW processor.

InterruptsRTI Return From Interrupt INT Throw Interrupt

TABLE 14. mVLIW Interrup Instructions

External I/OIOR External I/O Read IOS External I/O StartIOW External I/O Write IOX External I/O Stop

TABLE 15. mVLIW External I/O Instructions.


Discussion January 10, 2003 Page 30 of 43

14.1 Addressing

Addressing Mode. There is only one addressing mode defined for the mVLIW processor and it is fairly expensive to implement. Since this is a co-processor it may not be apparant that this addressing mode is warranted. It should be pointed out that the intended goal of the processor is to offload all processing from the host to the mVLIW processor. Given this assumption, multiple tasks will be runing in the system1. With this level of complexity, it is a requirement that common subroutines be capable of being shared. The facility provided allows the most general means of binding the subroutines across processes. Additionally, as multiple mVLIW processor are placed in the same system, this method of addressing allows multiple processor to reside in the same memory map and share memory variables without concern of page boundary crossings or segment limitations.

One other point about the model, the iDCT Load / Store counts were only 16% of the total iDCT operations. This would suggest that both a Load and a Store unit are not required. There are a few reasons for including them. First, the mVLIW MPEG model assumes that the data magically arrives in the mVLIW’s data memory. In reality, this will need to be block moved to it from a DMA device or by the mVLIW processor. Second, the heavy cost of VLIW Surrogate Instruction Memory encourages software pipelines to be built. If you need to flush the pipeline to store because only 1 units is available, performance will be decreased. Finally, when forming the pipeline you could use twice the number of surrogates (one with a Load and one with a Store) to avoid pipeline flushes but that is expensive in SIM memory. Therefore, this design adopted both a Load and a Store unit and made them base/index/offset/displacement addressing to allow multiple concurrent processors to share instructions and data.

Diplacement Value. Good design practice would dictate that the displacement value be a field in the instruction. For the mVLIW this would make programming VLIW repeat instructions very difficult. Therefore they have been placed in a register to allow the displacements to automatically update every cycle.

Base Registers In mVLIW I/O Space. The mVLIW processor has multiple memory spaces. In particular, the base registers (as well as other control functions) are in a separate 64kW memory space which is direct mapped. This design choice allowed for independent instruction and data base registers to be specified. It also allowed for a 16-bit immediate address to all important registers to be specified. These are mapped by an external host interface to one location in the PC’s (x86) I/O space. Since it is intended that there will not be significant host access to these registers except for initial program load, multiple host cycles from the x86 side to write these registers are feasible. This helps to alleviate the already limited, over-crowded host I/O space.

It should be noted that even though the special registers exist in I/O space, special mnemonics can be created for the assembler that map the names to the I/O space. This way, common operations such as LoadPSW can be programmed in assembler and automatically mapped to the correct mVLIW I/O address. Alternatively, the actual PSW address can be written directly in a MoveArithmeticToSpecialRegister instruction.

No Stack Provided. Subroutine calls within a program are provided for by the Branch And Link instruction. All other calls must be made through humble access. I suspect a stack will eventually need to be provided but it was not yet architecturally stable for this processor so it was not included in this project. Currently, all calling routines must save their state if they are to be recursively called.

Indexed Registers. Indexed Registers would provide a means of transferring a block of data to registers from data memory in the same manner that the calculated displacement field facilitates block load and stores to data memory. This was not included because it is directly in the critical performance path of the processor. Additionally, it is allieviated somewhat by the concurrent Move in the DSU.

Multi-port Data and Instruction Memories. These were included to facilitate high bandwidth operations and multi-processing capabilities (two processors accessing the same physical memory). They are not too expensive in 0.5um CMOS technology and the size is relatively small (4kB in this case).

1. For example, the Mpeg-2 C-code comprises 105 pages of 6pt font.


Discussion January 10, 2003 Page 31 of 43

Modulo Addressing. Modulo addressing and offset/displacement updates are fairly cumbersome. This is directly the result of going to a protected base address. There were not enough bits to specify any other locations so these were placed in a special mode register. This is a common practice but it is still not architecturally satisfying.

14.2 Branching/Control

Supervision Is Assumed. Based on the size of the MPEG-2 decode C-code, some operating system must be resident to manage the multiple tasks. Given that an operating system must exist, it is prudent to allow certain operation to only be performed by the supervisor. For the mVLIW, the notion of Rings was adopted with extensive checking built into the processor to ensure conformance. This also led to the creation of a separate opcode for the control registers. Even though the opcode space is very limited (only 5 bits), this allowed the register space to be very large and protected at the same time.

Exclusive Mode. A Ring 0 (kernal) program may place the processor in exclusive mode. During this period all external requests for access to mVLIW I/O registers or address space are post-poned. This allows for test and set semaphores to be atomic when multiple processors are accessing the same memory. A similar mechanism called Foreground/Background mode facilitates interrupt processing.

14.3 Execution

Execution Unit Specific Operations. By providing for dual half word operations, implementation constraints preclude providing all arithmetic computations in all execution units. This manifests itself by allowing for 64-bit additions in the multiplier which has an inherent 64-bit accumulator but only allowing for 32-bit operations in the alu.

Double Word Implied Register Specification. A lack of bits in the 32-bit instruction necesitated the use of an implied destination register pair for 64-bit double word operations. This is an example of impropriety where the performance benefits were too great to leave the operation out of the machine.

5-bit Immediate Execute Field. I was tempted to leave immediate execution out of the architecture. I felt the 5-bit field was too small. However, looking at the iDCT code convinced me it would still be useful since many operations incremented by 1 to 8.

14.4 Very Long Word Instructions

Branching. Branch instructions may not be included as VLIW execution slots. They are executed in simplex style. I could find no archtecturally clean way of doing this that did not cause machine races. To avoid some of the MIMD nature of certain applications, a conditional move has been provided.

Conditional Execute. A generalized form of the Conditional Move is the Conditional Execute Instruction. This was not included because it affects the critical path for every instruction. Further research may warrant including this instruction on future machines based upon branch frequencies of key algorithms. The MPEG-2 iDCT is totally unrolled and so it did not need this elaborate capability.

14.5 Miscellaneous

Heritage. This processor is the result of taking a scalable parallel embedded processor and attempting to make a code-compatible uni-processor. This object has been achieved however significant re-design of the original control sequencer for the IBM Mfast processor is required. In addition, a reserved field is required in the mVLIW instruction field that made bit specification much more difficult than it would have been if this was a one processor only architecture.


Calculated IDCT Performance January 10, 2003 Page 32 of 43

I/O Channel. A simple DMA channel mechanism has been provided. A separate processor did not seem to be warranted since an easy solution is to provide a memory mapped Local I/O for a peripheral interface to place the digitized MPEG-2 data. Alternatively, the host PC can DMA the data from a hard-drive to the mVLIW Local I/O memory mapped locations for decompression and display.

Expansion. The mVLIW Instruction Opcode field is very small however many instructions are still possible. Additionally, a good portion of opcodes are used up with only 6 remaining. This may make it difficult to generally specify alternate execution units (especially if immediate execution is provided). However, one obvious enhancement that could be specified is an additional Load and Store unit. These require only 1 opcode each and would provide twice the bandwidth for the machine. With these and an additional ALU, only one additional opcode specification is possible but all eight units can be specified.

15.0 Calculated IDCT PerformanceTable Table 16, “MPEG Decoder Execution Time Distributions,” on page 32 is calculated based upon the

sample code given in Section 16.4, "MPEG-2 Inverse Discrete Cosine Inner Loop" on page 33.

16.0 Code ExamplesTo characterize the mVLIW performance on MPEG-2 decode, a C version of the algorithm was obtained from

the Internet. This code was then modified by creating C-macros that emulated many of the mVLIW instructions. Instruction counting was performed by incrementing a counter variable in the C-code every time an instruction was emulated.

16.1 Simple Subroutine Call/************************************************************************//* Simple Subroutine Call *//************************************************************************/BranchAndLink($VectorReduceVLIW); /* Call same file scope subroutine */

/* See Section 16.3, "Inner Product (Reduce 2 vectors) VLIW Version" on page 33 */$VectorReduceVLIW:/* insert code here */

Code Listing 21. Simple Subroutine Call

16.2 Inner Product (Reduce 2 vectors)/************************************************************************//* nVector Times a nVector *//* - We’re the supervisor for this example *//************************************************************************/SupervisorMoveArithmeticToSpecial(DisableIndexMode); /* force Rindex to 0 */MoveImmediate16(RegCount, VectorLength); /* branch count */$loop:LoadBaseIndirectWord0(RbaseV1, RoffsetDisplacementV1, RegVec1);/* get first vector element V1 */LoadBaseIndirecttWord0(RbaseV2, RoffsetDisplacementV2, RegVec2);/* get first vector element V2 */MultiplyAccumulate(ResultReg, RegVec1, RegVec2); /* multiply */AluSubtractImmediate(RegCount, 1); /* Subtract 1 from count */BranchDirectAluRelative( AluNotZero, $loop); /* assembler can figure out offset */StoreBaseIndirectWord0(RbaseV3, RoffsetDisplacementV3, ResultReg);Return;

IDCT Function MPEG-2a

a. Based on file test.m2v

Alu 49%Dsu 21%Mau 15%Load/Store 16%MOPS 127

TABLE 16. MPEG Decoder Execution Time Distributions


Code Examples January 10, 2003 Page 33 of 43

Code Listing 22. Inner Product Reduce 2 Vectors

16.3 Inner Product (Reduce 2 vectors) VLIW Version/************************************************************************//* nVector Times a nVector: VLIW Version *//* - We’re the supervisor for this example *//************************************************************************/SupervisorMoveArithmeticToSpecial(DisableIndexMode); /* force Rindex to 0 */VliwDelimeterInstruction( ExecuteAll, Addr=0 ){

LoadBaseIndirectWord0(RbaseV1, RoffsetDisplacementV1, RegVec1); /* get first vector element V1 */LoadBaseIndirecttWord0(RbaseV2, RoffsetDisplacementV2, RegVec2); /* get first vector element V2 */

} /* Second one remains */VliwDelimiterInstruction( ExecuteAll, Addr=1 ){

LoadBaseIndirectWord0(RbaseV1, RoffsetDisplacementV1, RegVec1); /* get first vector element V1 */MultiplyAccumulate(ResultReg, RegVec1, RegVec2); /* multiply */

}#define UNROLL_LOOP(3) /* pre-processor loop unroller */{

VliwExecute( Address1 ); /* unroll for n elements */VliwExecute( Address0 ); /* note: index registers would be nice here */

}StoreBaseIndirectWord0(RbaseV3, RoffsetDisplacementV3, ResultReg); /* can branch loop on VLIW */Return;

Code Listing 23. Inner Product Reduce 2 Vectors VLIW Version

16.4 MPEG-2 Inverse Discrete Cosine Inner LoopThe code shown in Figure 24, "Original idct code for Architecture Verification" is one of many versions that

were created to investigate the performance of the architecture. This particular version has some instructions that no longer exist and it is still in mnemonic form but it is representative of the types of features that were tried.

/*****************************************************************************//* idct.c: Mfast IDCT - Sequencer only version *//*****************************************************************************//* Author: John Glossner *//* IBM Mwave *//* Dept. D50 *//* IBM RTP *//*****************************************************************************//* NOTES: *//* reference: MPEG Software Simulation Group idct.c *//*****************************************************************************//* STRUCTS DEFINED: *//*****************************************************************************//* 03/10/95: Original Coding JG *//*****************************************************************************/

/**********************************************************//* implemented in Mfast assembler by John Glossner *//**********************************************************//* inverse two dimensional DCT, Chen-Wang algorithm *//* (cf. IEEE ASSP-32, pp. 803-816, Aug. 1984) *//* 32-bit integer arithmetic (8 bit coefficients) *//* 11 mults, 29 adds per DCT *//* sE, 18.8.91 *//**********************************************************//* coefficients extended to 12 bit for IEEE1180-1990 *//* compliance sE, 2.1.94 *//**********************************************************/

/* comment out to avoid comparisons */

#define COMPARE_REF/* #define USE_REF_IDCT */

#define W1 2841#define W2 2676#define W3 2408#define W5 1609#define W6 1108#define W7 565

#define COMPARE_ROWS_STAGE1



#define COMPARE_ROWS_STAGE2#define COMPARE_ROWS_STAGE3

#define COMPARE_COLS_STAGE1#define COMPARE_COLS_STAGE2#define COMPARE_COLS_STAGE3

/* W1 = 2048*sqrt(2)*cos(1*pi/16) *//* W2 = 2048*sqrt(2)*cos(2*pi/16) *//* W3 = 2048*sqrt(2)*cos(3*pi/16) *//* W5 = 2048*sqrt(2)*cos(5*pi/16) *//* W6 = 2048*sqrt(2)*cos(6*pi/16) *//* W7 = 2048*sqrt(2)*cos(7*pi/16) */

#define MASK 0xFFFFFFFE /* may want to mask LSB = FFFFFFFE */

/* this code assumes >> to be a two’s-complement arithmetic *//* right shift: (-2)>>1 == -1 , (-3)>>1 == -2 */

#include “config.h”#include “mf1x0.h” /* mfast sequencer include */

/* global declarations */void init_idct _ANSI_ARGS_((void));void idct _ANSI_ARGS_((short *block));

/* private data */static short iclip[1024]; /* clipping table */static short *iclp;

/* private prototypes */static void idctrow _ANSI_ARGS_((short *blk));static void idctcol _ANSI_ARGS_((short *blk));

/*\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\*//* row (horizontal) IDCT *//*\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\*//*****************************************************************************//* *//* 7 pi 1 *//* dst[k] = sum c[l] * src[l] * cos( -- * ( k + - ) * l ) *//* l=0 8 2 *//* *//* where: c[0] = 128 *//* c[1..7] = 128*sqrt(2) *//*****************************************************************************/static void idctrow(blk)short *blk;{

#ifdef COMPARE_REF int x0, x1, x2, x3, x4, x5, x6, x7, x8; if (!((x1 = blk[4]<<11) | (x2 = blk[6]) | (x3 = blk[2]) | (x4 = blk[1]) | (x5 = blk[7]) | (x6 = blk[5]) | (x7 = blk[3]))) {/* blk[0]=blk[1]=blk[2]=blk[3]=blk[4]=blk[5]=blk[6]=blk[7]=blk[0]<<3; return;*/ }#endif

/* shortcut */ if (!((sp[0].regs.x[1] = blk[4]<<11) | (sp[0].regs.x[2] = blk[6]) | (sp[0].regs.x[3] = blk[2]) | (sp[0].regs.x[4] = blk[1]) | (sp[0].regs.x[5] = blk[7]) | (sp[0].regs.x[6] = blk[5]) | (sp[0].regs.x[7] = blk[3]))) { blk[0]=blk[1]=blk[2]=blk[3]=blk[4]=blk[5]=blk[6]=blk[7]=blk[0]<<3; return; }

/* setup */ SSASR(0,0); SSMSR(0);

sp[0].regs.x[0] = (blk[0]<<11) + 128; /* for proper rounding in the fourth stage */

/**************************/



/* S T A G E 1 *//**************************/#ifdef COMPARE_REF x0 = (blk[0]<<11) + 128; /* for proper rounding in the fourth stage */ /* first stage */ x8 = W7*(x4+x5); x4 = x8 + (W1-W7)*x4; x5 = x8 - (W1+W7)*x5; x8 = W3*(x6+x7); x6 = x8 - (W3-W5)*x6; x7 = x8 - (W3+W5)*x7;#endif

/*------------------------------*//* x8 = W7*(x4+x5); *//*------------------------------*//* sp[0].regs.x[8] = sp[0].regs.c7*(sp[0].regs.x[4]+sp[0].regs.x[5]); */ ADDW(sp[0].regs.tmp, sp[0].regs.x[4], sp[0].regs.x[5]); /* 32b ALU Add word */ MPYW(sp[0].regs.x[8], sp[0].regs.c7, sp[0].regs.tmp); /* 32x32 multiply */

/*------------------------------*//* x4 = x8 + (W1-W7)*x4; *//*------------------------------*/ SUBW(sp[0].regs.tmp, sp[0].regs.c1, sp[0].regs.c7); /* 32b ALU subtract */ MPYW(sp[0].regs.tmp, sp[0].regs.x[4], sp[0].regs.tmp); /* 32x32 multiply */ ADDW(sp[0].regs.x[4], sp[0].regs.x[8], sp[0].regs.tmp); /* 32b ALU Add word */

/*------------------------------*//* x5 = x8 - (W1+W7)*x5; *//*------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.c1, sp[0].regs.c7); MPYW(sp[0].regs.tmp, sp[0].regs.tmp, sp[0].regs.x[5]); SUBW(sp[0].regs.x[5], sp[0].regs.x[8], sp[0].regs.tmp);

/*------------------------------*//* x8 = W3*(x6+x7); *//*------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.x[6], sp[0].regs.x[7]); MPYW(sp[0].regs.x[8], sp[0].regs.c3, sp[0].regs.tmp);

/*------------------------------*//* x6 = x8 - (W3-W5)*x6; *//*------------------------------*/ SUBW(sp[0].regs.tmp, sp[0].regs.c3, sp[0].regs.c5); MPYW(sp[0].regs.tmp, sp[0].regs.tmp, sp[0].regs.x[6]); SUBW(sp[0].regs.x[6], sp[0].regs.x[8], sp[0].regs.tmp);


#ifdef COMPARE_ROWS_STAGE1 if( (x0 && MASK) != (sp[0].regs.x[0] && MASK) ) printf(“\nERROR: IDCT Rows, Stage1 -> x0 != x[0]”); if( (x1 && MASK) != (sp[0].regs.x[1] && MASK) ) printf(“\nERROR: IDCT Rows, Stage1 -> x1 != x[1]”); if( (x2 && MASK) != (sp[0].regs.x[2] && MASK) ) printf(“\nERROR: IDCT Rows, Stage1 -> x2 != x[2]”); if( (x3 && MASK) != (sp[0].regs.x[3] && MASK) ) printf(“\nERROR: IDCT Rows, Stage1 -> x3 != x[3]”); if( (x4 && MASK) != (sp[0].regs.x[4] && MASK) ) printf(“\nERROR: IDCT Rows, Stage1 -> x4 != x[4]”); if( (x5 && MASK) != (sp[0].regs.x[5] && MASK) ) printf(“\nERROR: IDCT Rows, Stage1 -> x5 != x[5]”); if( (x6 && MASK) != (sp[0].regs.x[6] && MASK) ) printf(“\nERROR: IDCT Rows, Stage1 -> x6 != x[6]”); if( (x7 && MASK) != (sp[0].regs.x[7] && MASK) ) printf(“\nERROR: IDCT Rows, Stage1 -> x7 != x[7]”); if( (x8 && MASK) != (sp[0].regs.x[8] && MASK) ) printf(“\nERROR: IDCT Rows, Stage1 -> x8 != x[8]”);#endif

/**************************//* S T A G E 2 *//**************************/#ifdef COMPARE_REF x8 = x0 + x1; x0 -= x1; x1 = W6*(x3+x2); x2 = x1 - (W2+W6)*x2; x3 = x1 + (W2-W6)*x3; x1 = x4 + x6; x4 -= x6; x6 = x5 + x7; x5 -= x7;#endif



/*------------------------------*//* x8 = x0 + x1; *//*------------------------------*/ ADDW(sp[0].regs.x[8], sp[0].regs.x[0], sp[0].regs.x[1]);

/*------------------------------*//* x0 -= x1; *//*------------------------------*/ SUBW(sp[0].regs.x[0], sp[0].regs.x[0], sp[0].regs.x[1]);

/*------------------------------*//* x1 = W6*(x3+x2); *//*------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.x[3], sp[0].regs.x[2]); MPYW(sp[0].regs.x[1], sp[0].regs.c6, sp[0].regs.tmp);


/*------------------------------*//* x3 = x1 + (W2-W6)*x3; *//*------------------------------*/ SUBW(sp[0].regs.tmp, sp[0].regs.c2, sp[0].regs.c6); MPYW(sp[0].regs.tmp, sp[0].regs.tmp, sp[0].regs.x[3]); ADDW(sp[0].regs.x[3], sp[0].regs.x[1], sp[0].regs.tmp);






/**************************//* S T A G E 3 *//**************************/#ifdef COMPARE_REF x7 = x8 + x3; x8 -= x3; x3 = x0 + x2; x0 -= x2; x2 = (181*(x4+x5)+128)>>8; x4 = (181*(x4-x5)+128)>>8;#endif







/*------------------------------*//* x2 = (181*(x4+x5)+128)>>8; *//*------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.x[4], sp[0].regs.x[5]); MPY0I(sp[0].regs.tmp, sp[0].regs.tmp, 181); SSASR(8,1); /* scale by 8, shift right */ ADDW0I(sp[0].regs.x[2], sp[0].regs.tmp, 128); /* add 0 extended 8b imm */

/*------------------------------*//* x4 = (181*(x4-x5)+128)>>8; *//*------------------------------*/ SSASR(0,0); SUBW(sp[0].regs.tmp, sp[0].regs.x[4], sp[0].regs.x[5]); MPY0I(sp[0].regs.tmp, sp[0].regs.tmp, 181); SSASR(8,1); /* scale by 8 */ ADDW0I(sp[0].regs.x[4], sp[0].regs.tmp, 128); /* add 0 extended 8b imm */


/**************************//* S T A G E 4 *//**************************/#ifdef USE_REF_IDCT blk[0] = (x7+x1)>>8; blk[1] = (x3+x2)>>8; blk[2] = (x0+x4)>>8; blk[3] = (x8+x6)>>8; blk[4] = (x8-x6)>>8; blk[5] = (x0-x4)>>8; blk[6] = (x3-x2)>>8; blk[7] = (x7-x1)>>8;

#else/*------------------------------*//* blk[0] = (x7+x1)>>8; *//*------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.x[7], sp[0].regs.x[1]); blk[0] = sp[0].regs.tmp;

/*------------------------------*//* blk[1] = (x3+x2)>>8; *//*------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.x[3], sp[0].regs.x[2]); blk[1] = sp[0].regs.tmp;

/*------------------------------*//* blk[2] = (x0+x4)>>8; *//*------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.x[0], sp[0].regs.x[4]); blk[2] = sp[0].regs.tmp;

/*------------------------------*//* blk[3] = (x8+x6)>>8; *//*------------------------------*/



ADDW(sp[0].regs.tmp, sp[0].regs.x[8], sp[0].regs.x[6]); blk[3] = sp[0].regs.tmp;

/*------------------------------*//* blk[4] = (x8-x6)>>8; *//*------------------------------*/ SUBW(sp[0].regs.tmp, sp[0].regs.x[8], sp[0].regs.x[6]); blk[4] = sp[0].regs.tmp;




/* clean up code */ SSASR(0,0);#endif

}

/*\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\*//* col (vertical) IDCT *//*\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\*//*****************************************************************************//* *//* 7 pi 1 *//* dst[8*k] = sum c[l] * src[8*l] * cos( -- * ( k + - ) * l ) *//* l=0 8 2 *//* *//* where: c[0] = 1/1024 *//* c[1..7] = (1/1024)*sqrt(2) *//*****************************************************************************/static void idctcol(blk)short *blk;{#ifdef COMPARE_REF int x0, x1, x2, x3, x4, x5, x6, x7, x8;

/* shortcut */ if (!((x1 = (blk[8*4]<<8)) | (x2 = blk[8*6]) | (x3 = blk[8*2]) | (x4 = blk[8*1]) | (x5 = blk[8*7]) | (x6 = blk[8*5]) | (x7 = blk[8*3]))) {/* blk[8*0]=blk[8*1]=blk[8*2]=blk[8*3]=blk[8*4]=blk[8*5]=blk[8*6]=blk[8*7]= iclp[(blk[8*0]+32)>>6]; return;*/ }#endif

/* shortcut */ if (!((sp[0].regs.x[1] = (blk[8*4]<<8)) | (sp[0].regs.x[2] = blk[8*6]) | (sp[0].regs.x[3] = blk[8*2]) | (sp[0].regs.x[4] = blk[8*1]) | (sp[0].regs.x[5] = blk[8*7]) | (sp[0].regs.x[6] = blk[8*5]) | (sp[0].regs.x[7] = blk[8*3]))) { blk[8*0]=blk[8*1]=blk[8*2]=blk[8*3]=blk[8*4]=blk[8*5]=blk[8*6]=blk[8*7]= iclp[(blk[8*0]+32)>>6]; return; }

sp[0].regs.x[0] = (blk[8*0]<<8) + 8192;

/* set up code */ SSASR(0,0); SSMSR(0);

/**************************//* S T A G E 1 *//**************************/



#ifdef COMPARE_REF x0 = (blk[8*0]<<8) + 8192; x8 = W7*(x4+x5) + 4; x4 = (x8+(W1-W7)*x4)>>3; x5 = (x8-(W1+W7)*x5)>>3; x8 = W3*(x6+x7) + 4; x6 = (x8-(W3-W5)*x6)>>3; x7 = (x8-(W3+W5)*x7)>>3;#endif

/*------------------------------*//* x8 = W7*(x4+x5) + 4; *//*------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.x[4], sp[0].regs.x[5]); MAUWI(sp[0].regs.x[8], sp[0].regs.tmp, sp[0].regs.c7, 4);

/*------------------------------*//* x4 = (x8+(W1-W7)*x4)>>3; *//*------------------------------*/ SUBW(sp[0].regs.tmp, sp[0].regs.c1, sp[0].regs.c7); MPYW(sp[0].regs.tmp, sp[0].regs.tmp, sp[0].regs.x[4]); SSASR(3,1); /* shift right 3 */ ADDW(sp[0].regs.x[4], sp[0].regs.x[8], sp[0].regs.tmp); SSASR(0,0);

/*------------------------------*//* x5 = (x8-(W1+W7)*x5)>>3; *//*------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.c1, sp[0].regs.c7); MPYW(sp[0].regs.tmp, sp[0].regs.tmp, sp[0].regs.x[5]); SSASR(3,1); /* shift right 3 */ SUBW(sp[0].regs.x[5], sp[0].regs.x[8], sp[0].regs.tmp); SSASR(0,0);


/*------------------------------*//* x6 = (x8-(W3-W5)*x6)>>3; *//*------------------------------*/ SUBW(sp[0].regs.tmp, sp[0].regs.c3, sp[0].regs.c5); MPYW(sp[0].regs.tmp, sp[0].regs.tmp, sp[0].regs.x[6]); SSASR(3,1); /* shift right 3 */ SUBW(sp[0].regs.x[6], sp[0].regs.x[8], sp[0].regs.tmp); SSASR(0,0);


#ifdef COMPARE_COLS_STAGE1 if( (x0 && MASK) != (sp[0].regs.x[0] && MASK) ) printf(“\nERROR: IDCT Cols, Stage1 -> x0 != x[0]”); if( (x1 && MASK) != (sp[0].regs.x[1] && MASK) ) printf(“\nERROR: IDCT Cols, Stage1 -> x1 != x[1]”); if( (x2 && MASK) != (sp[0].regs.x[2] && MASK) ) printf(“\nERROR: IDCT Cols, Stage1 -> x2 != x[2]”); if( (x3 && MASK) != (sp[0].regs.x[3] && MASK) ) printf(“\nERROR: IDCT Cols, Stage1 -> x3 != x[3]”); if( (x4 && MASK) != (sp[0].regs.x[4] && MASK) ) printf(“\nERROR: IDCT Cols, Stage1 -> x4 != x[4]”); if( (x5 && MASK) != (sp[0].regs.x[5] && MASK) ) printf(“\nERROR: IDCT Cols, Stage1 -> x5 != x[5]”); if( (x6 && MASK) != (sp[0].regs.x[6] && MASK) ) printf(“\nERROR: IDCT Cols, Stage1 -> x6 != x[6]”); if( (x7 && MASK) != (sp[0].regs.x[7] && MASK) ) printf(“\nERROR: IDCT Cols, Stage1 -> x7 != x[7]”); if( (x8 && MASK) != (sp[0].regs.x[8] && MASK) ) printf(“\nERROR: IDCT Cols, Stage1 -> x8 != x[8]”);#endif

/**************************//* S T A G E 2 *//**************************/#ifdef COMPARE_REF x8 = x0 + x1; x0 -= x1; x1 = W6*(x3+x2) + 4; x2 = (x1-(W2+W6)*x2)>>3; x3 = (x1+(W2-W6)*x3)>>3;



x1 = x4 + x6; x4 -= x6; x6 = x5 + x7; x5 -= x7;#endif





/*------------------------------*//* x3 = (x1+(W2-W6)*x3)>>3; *//*------------------------------*/ SUBW(sp[0].regs.tmp, sp[0].regs.c2, sp[0].regs.c6); MPYW(sp[0].regs.tmp, sp[0].regs.tmp, sp[0].regs.x[3]); SSASR(3,1); /* shift right 3 */ SUBW(sp[0].regs.x[3], sp[0].regs.x[1], sp[0].regs.tmp); SSASR(0,0);






/**************************//* S T A G E 3 *//**************************/#ifdef COMPARE_REF x7 = x8 + x3; x8 -= x3; x3 = x0 + x2;



x0 -= x2; x2 = (181*(x4+x5)+128)>>8; x4 = (181*(x4-x5)+128)>>8;#endif





/*------------------------------*//* x2 = (181*(x4+x5)+128)>>8; *//*------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.x[4], sp[0].regs.x[5]); MPY0I(sp[0].regs.tmp, sp[0].regs.tmp, 181); SSASR(8,1); ADDW0I(sp[0].regs.x[2], sp[0].regs.tmp, 128); SSASR(0,0);

/*------------------------------*//* x4 = (181*(x4-x5)+128)>>8; *//*------------------------------*/ SUBW(sp[0].regs.tmp, sp[0].regs.x[4], sp[0].regs.x[5]); MPY0I(sp[0].regs.tmp, sp[0].regs.tmp, 181); SSASR(8,1); ADDW0I(sp[0].regs.x[4], sp[0].regs.tmp, 128);


/* fourth stage */#ifdef USE_REF_IDCT blk[8*0] = iclp[(x7+x1)>>14]; blk[8*1] = iclp[(x3+x2)>>14]; blk[8*2] = iclp[(x0+x4)>>14]; blk[8*3] = iclp[(x8+x6)>>14]; blk[8*4] = iclp[(x8-x6)>>14]; blk[8*5] = iclp[(x0-x4)>>14]; blk[8*6] = iclp[(x3-x2)>>14]; blk[8*7] = iclp[(x7-x1)>>14];#else/*--------------------------------*//* blk[8*0] = iclp[(x7+x1)>>14]; *//*--------------------------------*/ SSASR(14,1); ADDW(sp[0].regs.tmp, sp[0].regs.x[7], sp[0].regs.x[1]); blk[8*0] = iclp[ sp[0].regs.tmp ];

/*--------------------------------*//* blk[8*1] = iclp[(x3+x2)>>14]; *//*--------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.x[3], sp[0].regs.x[2]); blk[8*1] = iclp[ sp[0].regs.tmp ];

/*--------------------------------*/



/* blk[8*2] = iclp[(x0+x4)>>14]; *//*--------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.x[0], sp[0].regs.x[4]); blk[8*2] = iclp[ sp[0].regs.tmp ];

/*--------------------------------*//* blk[8*3] = iclp[(x8+x6)>>14]; *//*--------------------------------*/ ADDW(sp[0].regs.tmp, sp[0].regs.x[8], sp[0].regs.x[6]); blk[8*3] = iclp[ sp[0].regs.tmp ];

/*--------------------------------*//* blk[8*4] = iclp[(x8-x6)>>14]; *//*--------------------------------*/ SUBW(sp[0].regs.tmp, sp[0].regs.x[8], sp[0].regs.x[6]); blk[8*4] = iclp[ sp[0].regs.tmp ];




#endif

/* clean up */ SSASR(0,0);

}

/* two dimensional inverse discrete cosine transform */void idct(block)short *block;{ int i;

for (i=0; i<8; i++) idctrow(block+8*i);

for (i=0; i<8; i++) idctcol(block+i);}

void init_idct(){ int i;

iclp = iclip+512; for (i= -512; i<512; i++) iclp[i] = (i<-256) ? -256 : ((i>255) ? 255 : i);

LIP0(sp[0].regs.c1, 2841); /* load sine table coeffs */ LIP0(sp[0].regs.c2, 2676); LIP0(sp[0].regs.c3, 2408); LIP0(sp[0].regs.c5, 1609); LIP0(sp[0].regs.c6, 1108); LIP0(sp[0].regs.c7, 565);

SSASR(0,0); /* Sequencer Alu Scaling Reg = 0 */ SSMSR(0); /* Sequencer Mau Scaling Reg = 0 */}

Code Listing 24. Original idct code for Architecture Verification


Responsibilities January 10, 2003 Page 43 of 43

17.0 ResponsibilitiesThis section is provided to delineate the work completed. I performed all of the performance analysis and

made all of the modifications to the original Mfast architecture. This architecture is very different from the parallel compute engine. Because of the extensive modifications, it will not run on any of our current tools. The instruction encodings are unique as are the implied interpretation fields. The processor is being considered as an enhancement to the control sequencer on the IBM Mfast parallel engine. I received substantial advice from the Mfast lead architect (Dr. Gerald Pechanek) to which I am deeply indebted. Other than the very relevant advice of Dr. Pechanek, all work and modelling is my own.

18.0 References [1] G. A. Blaauw and F. P. Brooks, Computer Architecture, Spring 1995 draft of unpublished manuscript. The

University of North Carolina at Chapel Hill.

[2] G.G. Pechanek, C.J. Glossner, W.F. Lawless, D.H. McCabe, C.H.L. Moller, S.J. Walsh, “A Machine Organization and Architcture for Highly Parallel, Scalable, Single Chip DSPs,”. Internal manuscript to appear at DSP-x in May 1995.

[3] G. D. Jones and L.D. Larsen, “Selecting Predecoded Instructions with a Surrogate”. IBM draft document obtainable from G. Pechanek, D50A, RTP, N.C.

[4] G. D. Jones, L.D. Larsen, C. R. Ogilvie, P.C. Stabler, and B. Blaner, “The Mwave Signal Processor (MSP) Architecture Level 2.0 Release 1.02”. IBM Architecture Specification.

[5] G. G. Pechanek, S. Vassiliadis, J. G. Delgado-Frias, “Massively Parallel Multiple-Folded Clustered Processor Mesh Array,” IBM Corporation Internal Technical Report - TR 29.1655, pp. 1-41, Research Triangle Park, N.C., May 1993.

[6] G. G. Pechanek, S. Vassiliadis, L. D. Larsen, C. J. Glossner, “Parallel Processing System and Method using Surrogate Instructions,” IBM Patent application RA8940014.

[7] J.P. Nussbaumer, “Transport and Synchronization of Multimedia Streams: A Tutorial to MPEG-II Systems,” IBM Internal Report, June, 1994.

[8] A. Puri, “Video Coding Using the MPEG-2 Compression Standard,” AT&T Bell Labs, SPIE Vol 2094, pp. 1701 - 1713, 1993.

[9] D. J. Le Gall, “The MPEG video compression algorithm,”, Signal Processing: Image Communication 4, pp. 129-140, Elsevier, 1992.

[10] M. M. Stojancic and C. Ngai, “VLSI Implementation of a Fully Compliant MPEG-2:MP@ML Video Decoder,” IBM internal pre-print of an SMPTE 94 presentation given in Sydney Australia, July 1994.

[11] R. B. Lee, “Accelerating Multimedia with Enhanced Microprocessor,” IEEE Micro, April 1995.

unc project report

Documents