scalable dsp core architecture addressing compiler ...edu.cs.tut.fi/panis483.pdf · christian panis...

Tampereen teknillinen yliopisto. Julkaisu 483 Tampere University of Technology. Publication 483 Christian Panis Scalable DSP Core Architecture Addressing Compiler Requirements Thesis for the degree of Doctor of Technology to be presented with due permission for public examination and criticism in Tietotalo Building, Auditorium TB104, at Tampere University of Technology, on the13th of August 2004, at 12 o�clock noon. Tampereen teknillinen yliopisto - Tampere University of Technology Tampere 2004

ISBN 952-15-1205-9 ISSN 1459-2045

Abstract

This thesis considers the definition and design of an embedded configurable DSP (Digital Signal Processor) core architecture and will address the necessary requirements for developing an optimizing high-level language compiler. The introduction provides an overview of typical DSP core architectural features, briefly discusses the currently available DSP cores and summarizes the architectural aspects which have to be considered when developing an optimizing high-level language compiler. The introduction is followed by a total of 12 publications which outline the research work carried out while providing a detailed description of the main core features and the design space exploration methodology.

Most of the research work focuses on architectural aspects of the configurable RISC (Reduced Instruction Set Computer) DSP core based on a modified Dual-Harvard load-store architecture. Due to increasing application code size and the associated configuration aspect the use of automatic code generation by a high-level language compiler is required. To generate code efficiently requires that the architectural aspects be considered as early as definition stage. This results in an orthogonal instruction set architecture with simple issue rules.

Architectural features are introduced to reduce area consumption and power dissipation to fulfill requirements of SoC (System-on-Chip) and SiP (System-in-Package) applications and close the gap between dedicated hardware implementations and software based system solutions. Code density has a significant influence on the area of the DSP sub-system, thus xLIW (scalable Long Instruction Word) is introduced. An instruction buffer allows the reduction of power dissipation during execution of loop-centric DSP algorithms. Simple issue rules and exhaustive predicated execution feature enable efficient cycle and power execution of control code.

The scalable DSP core architecture introduced herein allows parameterization of the main architectural features to application specific requirements. To make use of this feature it is necessary to analyze the requirements of the application. This thesis introduces a design space exploration methodology based on a C-compiler and a cycle-true instruction set simulator. A unique XML-based configuration file is used to reduce the implementation and validation effort for configuring the tool-chain, updating documentation and for automatic generation of parts of the VHDL-RTL core description.

III

Preface

The research work described in this thesis was carried out during 1999-2004 in Infineon Technologies Austria and in the Institute of Digital and Computer Systems at the Tampere University of Technology in Tampere, Finland.

I will like to express my deepest gratitude to my thesis advisor, Prof. Jari Nurmi. He introduced and guided me carefully through the scientific world. Jari hosted me during my stays at the university in Tampere and warmed the cockles of my heart in the sometimes cold Finland. Prof. Jarmo Takala as head of the Institute of Digital and Computer Systems supported my study work and along with Lasse Harju and Timo Rintakoski ensured me a warm and pleasant working environment during my time at TUT.

A note of gratitude goes out to Prof. Lars Wanhammar and Dr. Mika Kuulusa for reviewing my thesis and supporting me with imperative feedback.

Defining and developing a new DSP core when considering the approach of Hardware and Software Co-Definition can only be done with a competent and enthusiastic team. Therefore I would like to express my deepest thanks to the xDSPcore team which contributed excellent work during the long period.

Many thanks to Prof. Andreas Krall from Vienna University of Technology who influenced the xDSPcore architecture and considered aspects relevant when developing an optimizing C-compiler and to Karl Vögler and Ulrich Hirnschrott who developed the main parts of the C-compiler backend and supporting my thesis by contributing benchmarks and analysis results alongside many productive discussions.

Many thanks also to the internship and masters students who contributing to the xDSPcore research project including Pierre Elbischger, Gunther Laure, Wolfgang Lazian, Raimund Leitner, Michael Bramberger and many more.

During my time in Infineon Technologies I had the pleasure to meet many amazing people which led to a plethora of a lot of fruitful discussions again representative, many thanks to Herbert Zojer who supported the development of an innovative DSP core architecture, to Prof. Lajos Gazsi, Fellow of Infineon Technologies and Dr. Söhnke Mehrgardt the CTO of Infineon Technologies who guided the development team inside the company.

I would like to express my thanks to Dr. Franz Dielacher, Manfred Haas and Reinhard Petschacher at Infineon Technologies Austria and Prof. Herbert Grünbacher and Erwin Ofner at Carithian Tech Institute who enabled me to finalize my research work.

IV

In addition I would like to express my thanks to Prof. Tobias G. Noll and Volker Gierenz at RWTH Aachen who assisted the project from the beginning with their technical expertise.

The research was financially supported by Infineon Technologies Austria, the European Commission with the project SoC-Mobinet (IST-2000-30094) and the Carinthian Tech Institute who hosted me in the last two years. Many thanks.

Most of all I would like to express my deepest gratitude to my parents Maria and Herbert Panis and brother Peter who supported me unrelentingly throughout the long time period with their love. Only through their support was it possible for me to complete my studies in Tampere, Finland.

Tampere, August 2004

Christian Panis

V

Table of Contents

1 Introduction..................................................................................................................... 2

1.1 Motivation............................................................................................................... 2

1.2 Methodology........................................................................................................... 3

1.3 Goals ....................................................................................................................... 4

1.4 Outline of Thesis..................................................................................................... 4

2 DSP Specific Features..................................................................................................... 7

2.1 Introduction............................................................................................................. 7

2.2 Saturation ................................................................................................................ 7

2.3 Rounding................................................................................................................. 9

2.4 Fixed-Point, Floating-Point................................................................................... 10

2.5 Hardware Loops.................................................................................................... 12

2.6 Addressing Modes ................................................................................................ 13

2.7 Multiple Memory Banks ....................................................................................... 18

2.8 CISC Instruction Sets............................................................................................ 19

2.9 Orthogonality ........................................................................................................ 20

2.10 Real-Time Requirements ...................................................................................... 21

3 DSP cores...................................................................................................................... 23

3.1 Design Space......................................................................................................... 23

3.2 Architectural Alternatives..................................................................................... 31

3.3 Available DSP Core Architectures ....................................................................... 37

3.4 xDSPcore .............................................................................................................. 49

4 High Level Language Compiler Issues......................................................................... 51

4.1 Coding Practices in DSP�s .................................................................................... 51

VI

4.2 Compiler Overview............................................................................................... 59

4.3 Requirements ........................................................................................................ 62

4.4 HLL-Compiler Friendly Core Architecture .......................................................... 69

5 Summary of Publications.............................................................................................. 73

5.1 Architectural Aspects of Scalable DSP Core........................................................ 73

5.2 Design Space Exploration..................................................................................... 76

5.3 Author�s Contribution to Published Work............................................................ 77

6 Conclusion .................................................................................................................... 81

6.1 Main Results ......................................................................................................... 81

6.2 Future Research .................................................................................................... 84

7 References..................................................................................................................... 89

VII

List of Publications

This thesis is split into two parts with the first containing an introduction into Digital Signal Processor architectures and the second part a reprint of the publications listed below.

[P1] C. Panis, J. Nurmi, �xDSPcore - a Configurable DSP Core�, Technical Report 1-2004, Tampere University of Technology, Institute of Digital and Computer Systems, Tampere, Finland, May 2004.

[P2] C. Panis, R. Leitner, H. Grünbacher, J. Nurmi, �xLIW � a Scaleable Long Instruction Word�, in Proceedings The 2003 IEEE International Symposium on Circuits and Systems (ISCAS 2003), Bangkok, Thailand, May 25-28, 2003, pp. V69-V72.

[P3] C. Panis, R. Leitner, H. Grünbacher, J. Nurmi, �Align Unit for a Configurable DSP Core�, in Proceedings on the IASTED International Conference on Circuits, Signals and Systems (CSS 2003), Cancun, Mexico, May 19-21, 2003, pp. 247-252.

[P4] C. Panis, M. Bramberger, H. Grünbacher, J. Nurmi, �A Scaleable Instruction Buffer for a Configurable DSP Core�, in Proceedings of 29th European Solid State Conference (ESSCIRC 2003), Estoril, Portugal, September 16-18, 2003, pp. 49-52.

[P5] C. Panis, H. Grünbacher, J. Nurmi, �A Scaleable Instruction Buffer and Align Unit for xDSPcore�, IEEE Journal of Solid-State Circuits, Volume 35, Number 7, July 2004, pp. 1094-1100.

[P6] C. Panis, U. Hirnschrott, A. Krall, G. Laure, W. Lazian, J. Nurmi, �FSEL � Selective Predicated Execution for a Configurable DSP Core�, in Proceedings of IEEE Annual Symposium on VLSI (ISVLSI-04), Lafayette, Louisiana, USA, February 19-20, 2004, pp. 317-320.

[P7] C. Panis, G. Laure, W. Lazian, H. Grünbacher, J. Nurmi, �A Branch File for a Configurable DSP Core�, in Proceedings of the International Conference on VLSI (VLSI�03), Las Vegas, Nevada, USA, June 23-26, 2003, pp. 7-12.

[P8] C. Panis, R. Leitner, J. Nurmi, �A Scaleable Shadow Stack for a Configurable DSP Concept�, in Proceedings The 3rd IEEE International Workshop on System-on-Chip for Real-Time Applications (IWSOC), Calgary, Canada, June 30-July 2, 2003, pp. 222-227.

VIII

[P9] C. Panis, J. Hohl, H. Grünbacher, J. Nurmi, �xICU - a Scaleable Interrupt Unit for a Configurable DSP Core�, in Proceedings 2003 International Symposium on System-on-Chip (SOC�03), Tampere, Finland, November 19-21, 2003, pp. 75-78.

[P10] C. Panis, G. Laure, W. Lazian, A. Krall, H. Grünbacher, J. Nurmi, �DSPxPlore � Design Space Exploration for a Configurable DSP Core�, in Proceedings International Signal Processing Conference (GSPx), Dallas, Texas, USA, March 31- April 3, 2003, CD-ROM.

[P11] C. Panis, U. Hirnschrott, G. Laure, W. Lazian, J. Nurmi, �DSPxPlore - Design Space Exploration Methodology for an Embedded DSP Core�, in Proceedings of the 2004 ACM Symposium on Applied Computing (SAC 04), Nicosia, Cyprus, March 14-17, 2004, pp. 876-883.

[P12] C. Panis, A. Schilke, H. Habiger, J. Nurmi, �An Automatic Decoder Generator for a Scaleable DSP Architecture�, in Proceedings of the 20th Norchip Conference (Norchip�02), Copenhagen, Denmark, November 11-12, 2002, pp. 127-132.

IX

List of Figures

Figure 1: Chosen Methodology for Definition of the Core Architecture. .............................. 3

Figure 2: Principle of Saturation............................................................................................. 8

Figure 3: Two's Complement Rounding (Motorola 56000 family). ....................................... 9

Figure 4: Convergent Rounding (Motorola 56000 family)................................................... 10

Figure 5: Integer versus Fractional Data Representation...................................................... 11

Figure 6: Fractional Multiplication Including Left Shift. ..................................................... 12

Figure 7: Assembly Code Example for Finite Impulse Response (FIR) Filter. ................... 12

Figure 8: Example for Implied Addressing: Multiply Operation (Lucent 16xx).................. 13

Figure 9: Example for Implied Addressing: MAX2VIT Instruction (Starcore SC140). ...... 13

Figure 10: Example for Immediate Data Addressing: MOVC instruction (xDSPcore). ...... 13

Figure 11: Example for Register Direct Addressing: Subtraction (TI C62x). ...................... 14

Figure 12: Principle of Register Indirect Addressing. .......................................................... 14

Figure 13: Principle of Pre/Post Operation Mode................................................................. 15

Figure 14: Assembly Code Example for Pre/Post Increment Instructions (xDSPcore). ...... 15

Figure 15: Principle of Using a Modulo Buffer for Address Generation. ............................ 16

Figure 16: Principle of the Bit Reversal Addressing Scheme............................................... 17

Figure 17: Assembly Code Example for Short Immediate Data (xDSPcore). ..................... 17

Figure 18: Processor Architectures: von Neumann, Harvard, modified Dual-Harvard. ...... 18

Figure 19: Example for Interleaved Memory Addressing (SC 140)..................................... 19

Figure 20: Example for CISC Instructions: Multiply and Accumulate (MAC). .................. 20

Figure 21: Influence of Binary Coding on Application Code Density (using the same ISA).

....................................................................................................................................... 24

Figure 22: Principle of Define-in-use Dependency. ............................................................. 26

Figure 23: Principle of Load-in-use Dependency. ................................................................ 27

Figure 24: Example for Data Memory Bandwidth Limitations (Starcore SC140). .............. 27

Figure 25: Architectural Alternatives: Issue Rates for Available DSP Core Architectures. 31

Figure 26: Architectural Alternatives: RISC versus CISC Pipeline. .................................... 34

Figure 27: Architectural Alternatives: Pipeline Depth of Available DSP Cores.................. 34

Figure 28: Architectural Alternatives: Direct Memory versus Load-Store. ......................... 35

Figure 29: Architectural Alternatives: Mode Dependent Limitations during Instruction

Scheduling..................................................................................................................... 36

X

Figure 30: Architectural Overview: OAKDSP Core. ........................................................... 38

Figure 31: Architectural Overview: Motorola 56300. .......................................................... 38

Figure 32: Architectural Overview: TI C54x........................................................................ 41

Figure 33: Architectural Overview: ZSP400. ....................................................................... 42

Figure 34: Architectural Overview: Carmel. ........................................................................ 44

Figure 35: Architectural Overview: TI C6xx........................................................................ 46

Figure 36: Architectural Overview: Starcore SC140............................................................ 47

Figure 37: Architectural Overview: Blackfin. ...................................................................... 48

Figure 38: Architectural Overview: xDSPcore..................................................................... 50

Figure 39: Principle of Software Pipelining. ........................................................................ 52

Figure 40: Data Flow Graph of an Example Issuing Summation of two Data Values. ........ 53

Figure 41: Example for Assembler Code Implementation including Software Pipelining

(xDSPcore).................................................................................................................... 54

Figure 42: Data Flow Graph for Maximum Search Example............................................... 55

Figure 43: C-Code Example for Illustration of Software Pipelining. ................................... 56

Figure 44: Generated Assembler Code without Software Pipelining (xDSPcore). .............. 56

Figure 45: Generated Assembler Code including Software Pipelining (xDSPcore). ........... 56

Figure 46: Principle of Loop Unrolling. ............................................................................... 57

Figure 47: Principle of Predicated Execution using Loop Flags. ......................................... 58

Figure 48: General High-level Language Compiler Structure.............................................. 59

Figure 49: Example for banked Register Files (TI C62x). ................................................... 63

Figure 50: Limitations during Instruction Scheduling caused by Processor Modes. ........... 64

Figure 51: Example for Address Generation Unit (Motorola 56300)................................... 65

Figure 52: Example for not Orthogonal Instructions: MAX2VIT D4,D2 (Starcore SC140).

....................................................................................................................................... 65

Figure 53: Example for Mode Dependent Instruction Sets: ARM Thumb Decompression

Logic. ............................................................................................................................ 67

Figure 54: Example for Address Generation Unit (Starcore SC140). .................................. 68

Figure 55: Configurable Long Instruction Word (CLIW of Carmel DSP Core). ................. 69

Figure 56: xDSPcore Core Overview. .................................................................................. 70

Figure 57: Orthogonal Register File. .................................................................................... 70

Figure 58: Issuing Rules for xDSPcore Architecture. .......................................................... 71

Figure 59: Results for Dhrystone Benchmarks generated by C-Comipler. .......................... 72

Figure 60: Results for EFR Benchmarks generated by C-Compiler..................................... 72

XI

Figure 61: xDSPcore Overview. ........................................................................................... 73

Figure 62: DSPxPlore Overview. ......................................................................................... 76

Figure 63: Screenshot of xSIM............................................................................................. 82

Figure 64: DSPxPlore Design Flow...................................................................................... 83

XIII

List of Tables

Table 1: Principle of Resource Allocation Table.................................................................. 53

Table 2: Resource Allocation Table including Software Pipeline Technology for increased

Usage of Core Resources. ............................................................................................. 54

XV

List of Abbreviations

AGU Address Generation Unit

ALU Arithmetic Logic Unit

ANSI American National Standard Institute

ASIC Application Specific Integrated Circuit

ASIP Application Specific Instruction Set Processor

BMU Bit Manipulation Unit

CISC Complex Instruction Set Computer

CLIW Configurable Long Instruction Word

CMOS Complementary Metal Oxide Semiconductor

CPU Central Processing Unit

DMA Direct Memory Access

DPG Data Path Generator

DRAM Dynamic Random Access Memory

DRM Digital Radio Mondale

DSP Digital Signal Processor

FFT Fast Fourier Transformation

FIR Finite Impulse Response

FPGA Field Programmable Gate Array

FSM Finite State Machine

GOPS Giga Operations Per Second

GPP General Purpose Processor

HDL Hardware Description Language

HLL High-Level Language

IC Integrated Circuit

XVI

ICU Interrupt Control Unit

IEEE Institute of Electrical and Electronics Engineers

ILP Instruction Level Parallelism

IR Intermediate Representation

ISA Instruction Set Architecture

ISO International Organization for Standardization

ISR Interrupt Service Routine

ISS Instruction Set Simulator

LCP Loop Carry Path

LSB Least Significant Bit

MAC Multiply and Accumulate

MSB Most Significant Bit

MII Minimum Initiation Interval

MIMD Multiple Instruction Multiple Data

MIPS Million Instructions Per Second

MMACS Million MAC Instructions Per Second

MOPS Million Operations Per Second

MTCMOS Multi-Threshold CMOS

NMI Non Maskable Interrupt

NOP No Operation

OCE Open Compiler Environment

OS Operating System

PCU Program Control Unit

RAM Random Access Memory

RISC Reduced Instruction Set Computer

RTOS Real-Time Operating System

SIMD Single Instruction Single Data

XVII

SiP System in Package

SJP Split Join Path

SMT Simultaneous Multithreading

SoC System on Chip

SSA Static Single Assignment

TLB Translation Lookaside Buffer

TLP Task Level Parallelism

VHDL VHSIC Hardware Description Language

VHSIC Very High Speed Integrated Circuit

VLES Variable Length Execution Set

VLIW Very Long Instructions Word

WCET Worst Case Execution Time

xICU Scaleable Interrupt Control Unit

xLIW Scaleable Long Instructions Word

Part I

INTRODUCTION

2

1 Introduction The introduction begins with a short description of the motivation upon why defining and developing a new DSP core architecture was chosen for the thesis. This shall be followed by a brief introduction of the chosen methodology. A few sentences will then illustrate the goals of the development project carried out for this thesis before the outline of the thesis is provided.

1.1 Motivation Increasing complexity of System-on-Chip (SoC) applications increases the demand on powerful embedded cores. The flexibility provided by the usage of software programmable cores quite often leads to an increase in consumed silicon area and an increased power dissipation. Therefore dedicated hardware is favored over software-based platform solutions. The picture is changing however due to significantly increasing mask costs due to the use of advanced process technologies and difficulties to enter such high-volume products to the heterogeneous market that would justify the high non-recurring cost. Together these elements increase the pressure for developing product platforms. These platforms are used for a group of applications so that software executed on programmable core architectures can be used for differentiating the products.

General purpose processors with a fixed Instruction Set Architecture (ISA) are less well suited for integration into platforms. To close the gap between dedicated hardware implementations and software-based solutions requires core architectures which enable platform-specific and application-specific adaptations.

For embedded Digital Signal Processors (DSP) an additional problem exists. Non-orthogonal core architectures are preferred due to increased performance and less area consumption when mapping DSP algorithms onto a processor. Therefore DSPs are still programmed manually in assembly language [162]. The only drawback of the better usage of available processor resources is an architecture-dependent description of the algorithms which makes changes in the core architecture difficult and costly (due to compatibility issues) and prohibits application-specific adaptations [113]. Therefore products based on a programmable core architecture remain with the same architecture for a long time even if not state-of-the-art any more.

Consequences from using assembly language are long development cycles [174]. 10 years ago algorithms executed on DSP cores consisted of several hundred lines of code. Manual coding was reasonable even if minor changes in the application code required several weeks of coding and verification. Today�s DSP cores are more powerful and enable the execution of large programs consisting of several hundred thousand lines of code. DSP cores are not

3

only used for filtering operations any more, most notably in low cost products where not more than one core is reasonable and the control code is executed on the DSP core.

To increase the performance of the DSP subsystems a high degree of parallelism and deep pipeline structures are introduced. Unfortunately manual programming of highly parallelized DSP core architectures with deep pipelines and resolving data and control dependencies is limited or even impossible. Therefore the motivation of using assembly code to increase the use of the available resources is not valid any more.

1.2 Methodology To obtain the definition of a DSP core programmable in a high-level language and not to make just another DSP core, the methodology for defining the core architecture has to be changed to meet the target. The technical reason along many commercial ones as to why efficient high-level language programming of DSP cores is still not feasible is the compromise for improved efficiency in terms of area and power consumption for the price of orthogonality. This is the major requirement for the compiler architecture. Considering early DSP cores as programmable filter structures the major driving factor for the architectural features has been initiated by the algorithms executed on the cores. Some constraints influencing available core architectures have been caused by the possibility to implement the architecture in hardware with reasonable core performance such as banked register files, mode registers and complex instruction sets.

Figure 1 outlines the definition of design methodology of the core architecture introduced in this thesis. The development of an optimizing high-level language compiler has been considered during the definition of the feature set and the main architectural concepts. Thus it differs to already existing core architectures.

Figure 1: Chosen Methodology for Definition of the Core Architecture.

Before adding an instruction into the instruction set architecture (ISA) its suitability for the three aspects algorithm, hardware implementation and software suitability has been verified.

4

1.3 Goals To close the gap between dedicated hardware implementations and software based solutions a paradigm change is required. The main architectural features of the core subsystem have to be scaled to enable a definition of an application specific optimum in the terms of area consumption and power dissipation. To obtain this goal it is necessary to consider the DSP subsystem instead of focusing only on the core architecture. To overcome the software compatibility issues caused by the scaleable core features the programming has to take place in a high level language (HLL) like C. This enables an architectural independent application description. However HLL compilers reduce software development effort and maintenance costs. To enable the development of an optimizing HLL compiler generating efficient code (whereas efficient means less than 10% overhead compared with manual coding) requires restricting the design space for the core architecture. The goal for the core architecture can be summarized as follows- a scaleable DSP core architecture to meet area and power targets to be competitive with hardwired implementations suited as a target for an optimizing C-compiler and designated for efficient execution of control code and loop-centric DSP-specific algorithms as well.

The proposed approach is to provide an application-specific scaleable DSP core architecture. To gain the advantage of this approach it is strictly required to understand the application specific requirements of the core architecture. For this purpose a design space exploration methodology is introduced to analyze the influence of different core configurations onto area consumption (and later on also onto power dissipation) for specific application code.

Flexibility and scalability increase verification and validation effort. To keep this effort reasonably low a unique configuration file is introduced. When changing parameters the current core configuration propagates automatically to software tools, the VHDL-RTL description used for generating silicon and the documentation which is then automatically updated.

1.4 Outline of Thesis This thesis consists of two parts namely an introductory Part I structured as outlined below followed by Part II which illustrates the main research results in 12 publications.

Chapter two starts with the introduction of DSP specific architectural features and introduces system aspects like worst case execution time. The first part of the third chapter briefly discusses the design space of core subsystems while considering area consumption, performance and power dissipation followed by architectural alternatives and their suitability for being used in DSP core architectures. The third chapter ends with an introduction of some commercially available DSP core architectures and a brief illustration of xDSPcore which is the configurable DSP core architecture introduced in this thesis. The

5

fourth part discusses issues concerning high-level language compilers starting with typical coding practices used during implementation of algorithms in the field of digital signal processing. Then follows a short introduction of the structure of high-level language compilers. The fourth part ends with a discussion about the necessary requirements which of a high-level language compiler and proceeds to summarize the architectural requirements to obtain efficient compilation results. In the fifth chapter a summary of the publications provides an overview of the research work and summarizes the author�s contribution. The sixth and final chapter contains a conclusion with a summary of the results of the project and provides an overview of future research topics.

7

2 DSP Specific Features This section illustrates DSP specific architectural features which differentiate DSPs from traditional microcontroller architectures. The architectural features are introduced and the motivation for choosing them analyzed.

Some of these features exist in microcontroller architectures used to increase performance of the core when executing algorithms in the field of digital signal processing [4][7][95].

2.1 Introduction �DSP is an embedded microprocessor specifically designed to handle signal processing algorithms cost effectively�, where cost effectiveness means low silicon area and low power dissipation [102]. To obtain this target while considering the specific requirements of digital signal processing algorithm-specific hardware is utilized to meet the performance, power and area targets. Orthogonality in contradiction with these targets is ignored.

The consequence of ignoring orthogonal structure leads to highly specialized core architectures programmed manually by experts in assembly language. Developing a high-level language compiler for the specialized features is costly and requires so called compiler known functions and intrinsics to invoke the efficient use of the specialized hardware [37]. The consequences are algorithm descriptions which are not easily portable to different core architectures.

The alternative is to use pure ANSI-C [109] and not make full use of the specialized features which decreases the potential performance on the core architecture. In 2003 a first draft of a standard for the embedded-C language was introduced [169]. Based on ANSI-C additional standardized enhancements are introduced to make use of the special features required to implement digital signal processing algorithms efficiently [24]. The advantage of standardized intrinsics is that compilers for different core architectures are able to compile the same algorithmic source.

2.2 Saturation When the result of an operation exceeds the size of the destination register an overflow or underflow takes place. Signed twos complement number presentation changes the sign bit and leads therefore to a significant error. As illustrated in Figure 2 the envelope of a signal is significantly changed and the error caused by the overflow or underflow is crucial.

To overcome this problem traditional DSP architectures support saturation circuits. If the value of a result exceeds the data range of the storage the highest or lowest possible value which can be presented correctly is used instead of the calculated result. The error generated by the saturation circuit as illustrated in Figure 2 is thus minimal.

8

In commercially available DSP core architectures three saturation mechanisms are commonly used.

DSP cores support accumulator registers, thus differing to microcontroller architectures where the 16-bit and 32-bit registers support accumulators are registers with additional bits called guard bits. For example the SC140 from Starcore LCC [32] supports a 40-bit wide accumulator register. These guard bits allow storage of intermediate results exceeding the data range, for example 32 bits. To store the final result to data memory the value has to fit into the data range supported by the data memory port. Therefore the result stored in the accumulator register has to be evaluated for the required data range. If the value stored in the accumulator register exceeds the maximum value supported by the data memory port (indicated by guard bits different to the sign bit) it has to be saturated.

Some of the DSP cores, for example the Blackfin from Analog Devices and Intel [8] support an additional saturation method. To indicate the necessity of saturation the overflow flag is evaluated and overflow or underflow takes place e.g. for a 16-bit value the result is saturated to fit into a 16-bit destination register.

The third mechanism uses a saturation mode where the computation results are matched to the allowed word length after each arithmetic operation [21].

-10000

-5000

0

5000

10000

15000

20000

25000

30000

35000

40000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

original signalsaturated signalsignal overflow

Figure 2: Principle of Saturation.

Figure 2 illustrates three signals. The solid line shows the original signal flow without limitations. The dotted line illustrates the signal flow for the original signal when the data range of the destination register is exceeded which leads to signal flipping. The dashed line represents the saturated signal. When the original signal exceeds the data range in this example of a 16-bit register the result is saturated and the largest possible value is stored. The generated error is thus kept minimal compared with the flipping signal.

9

2.3 Rounding DSP core architectures support different rounding modes. In this subsection the two rounding modes available in most commercial core architectures are introduced. To illustrate the rounding modes 40-bit accumulator registers are used. Commercial core architectures define the rounding modes only for rounding 32-bit values to 16-bit values. The guard bits remain.

2.3.1 Two�s Complement Rounding Two�s Complement rounding is also called round-to-nearest technique [116]. If the value of the lower half of the data word is greater than or equal to half of the LSB of the resulting rounded word the values are rounded up, all values smaller than half are rounded down. Therefore statistically a small positive bias is introduced. In Figure 3 the Two�s Complement rounding is illustrated. Independent from the Least Significant Bit (LSB) of the high word one is added which causes the positive bias.

Figure 3: Two's Complement Rounding (Motorola 56000 family).

2.3.2 Convergent Rounding A slightly improved rounding methodology is convergent rounding also called round-to-nearest even number [116]. The above discussed bias caused by the decision at half bit is compensated by rounding down if the high portion is even and rounding up if the high portion is odd. In Figure 4 convergent rounding is illustrated.

10

Figure 4: Convergent Rounding (Motorola 56000 family).

Different to Two�s Complement Rounding the addition of one is only done, if the LSB of the high word is equal to one.

2.4 Fixed-Point, Floating-Point DSP core architectures can be divided into those supporting fixed-point arithmetic and those supporting floating-point arithmetic, the floating-point DSP architectures support mostly integer fixed-point arithmetic for example to obtain address calculation. Floating-point data presentation uses a combination of significand and exponent:

value = significand * 2exponent

Fixed-point presentation is chosen for most of the available core architectures especially those used as embedded cores for SoC or SiP applications. The algorithm development for fixed-point arithmetic requires more care but the hardware implementation implies less power and area consumption and is therefore favored.

Fixed-point numbers are also called fractional presentation with the special case integer. The place of the virtual binary point in the data word determines the number of integer and fraction bits in the word. In integers the point is to the right of LSB whereas in fractional numbers the point is right after the sign bit. Some operations like addressing and control code functions are inherent of type integer. Filtering operations on the other hand make use of fractional data representation.

The difference is illustrated in Figure 5 for a 40-bit wide accumulator register. The common fixed point format used for DSP cores is S[15.1] where the S stands for signed [172] and scales the data in the range -1 ≤ X < 1. The radix point is located on the left side of the

11

register between the sign bit and the next one. The number located right from the radix point encodes the fraction.

The guard bits as mentioned before are used to store intermediate results exceeding the data range of the destination register. For the example in Figure 5 eight guard bits are supported. This allows a data range of -256 ≤ X< 255. Integer can be interpreted as a special case of fractional where the radix point is located at the end of the register because no fractional values are supported.

Figure 5: Integer versus Fractional Data Representation.

The advantage of using fractions for traditional DSP algorithms like filtering is that by reducing the number of bits available for representing a value (e.g. during rounding operation) accuracy is changing but the value remains correct. Integers are used for control code and address generation. In [172] definitions of accuracy, data range and different variants of fractional data representation can be found.

With the exception of multiplication operations there is no difference between concerning hardware implementation aspects of arithmetic functions for fractional and integer data types. The multiplication requires a left shift by one bit and a setting of the LSB to zero to correct the position of the radix point. In Figure 6 the required shift for signed multiplication is illustrated.

12

Figure 6: Fractional Multiplication Including Left Shift.

2.5 Hardware Loops Filter algorithms often executed on DSPs are loop-centric. The code example in Figure 7 represents a Finite Impulse Response (FIR) filter [103]. The filter kernel consists of only one clock cycle. It loads the operands (including address calculation) and calculates one filter tap. Software pipelining is used to compensate the load-in-use dependency, caused by split execution which is illustrated in a later section.

ld (r0)+, d0 || ld (r1)+, d1 ld (r0)+, d0 || ld (r1)+, d1 || rep n

mac d0,d1,a4 mac d0,d1,a4

Figure 7: Assembly Code Example for Finite Impulse Response (FIR) Filter.

Executing the same algorithm on a traditional microcontroller architecture leads to additional instructions and clock cycles necessary for loop handling. Traditional microcontrollers do not support single Multiply and Accumulate (MAC) instructions and therefore require at least two instructions (multiply and accumulate) to calculate a filter tap. Single-issue microcontroller architectures do not support enough hardware resources to make use of SW pipelining. Loop handling consists of setting the loop counter once, decrementing the loop counter with each loop iteration and evaluating the end of loop condition continuously. If no explicit loop instruction is available conditional branch instructions are required to implement the jump back to loop start.

To implement loop constructs more efficiently DSP core architectures support zero-overhead loop instructions. The loop is invoked so that the loop length and the number of iterations is part of the loop instruction. The remaining loop handling like decrementing of

13

the loop counter and jump back to loop start with each loop cycle is implicitly handled in hardware. Branch delays caused by regular branch instructions are compensated.

2.6 Addressing Modes This sub-section introduces different addressing modes supported by DSP core architectures [116] some of which exist in traditional microcontroller architectures.

2.6.1 Implied Addressing For Implied Addressing the addresses of the source operands are implicitly coded in the instruction word. Examples for implied addressing can be found in former core architectures like the Lucent 16xx core family where the multiplier source operands are located in two registers X and Y. Even if the assembler syntax for the multiplication contains explicitly named registers (X, Y as illustrated in Figure 8) the multiplication does not allow different registers to be assigned.

P = X*Y

Figure 8: Example for Implied Addressing: Multiply Operation (Lucent 16xx).

Similar examples can be found in later DSP architectures like the Starcore SC140 where functions like the MAX2VIT instruction uses implied addressing [26]. Two register pairs are supported and selected by a mode bit. An example is illustrated in Figure 9.

MAX2VIT D2, D4

Figure 9: Example for Implied Addressing: MAX2VIT Instruction (Starcore SC140).

Implied addressing can be used to increase code density but restricts the use of the implied registers during register allocation for other instructions.

2.6.2 Immediate Data Addressing Immediate Data Addressing is used for operations where the operand is part of the instruction word. Examples for immediate addressing can be found in most core architectures supporting the preload of register values with immediate data, e.g. move constant (movc) instruction of xDSPcore illustrated in Figure 10.

MOVC 27,D4

Figure 10: Example for Immediate Data Addressing: MOVC instruction (xDSPcore).

2.6.3 Memory Direct Addressing Memory Direct Addressing is also called absolute addressing. The address where data has to be fetched from or stored is part of the instruction word. This is the main limiting factor of making use of absolute addressing. The reachable address space is limited by the available coding space in the instruction word or instruction words.

14

2.6.4 Register Direct Addressing Register Direct Addressing is used for instructions which receive their operands from registers which are addressed as part of the instruction. The difference to implied addressing is that the registers are explicitly coded inside the instruction word which allows assignment of different registers to the same instruction during register allocation.

SUBF R1, R4

Figure 11: Example for Register Direct Addressing: Subtraction (TI C62x).

In Figure 11 an example of TI C62x is illustrated [36]. The two operand subtract instruction allows an assignment of different source and destination operands to the same subtraction instruction.

2.6.5 Register Indirect Addressing Register Indirect Addressing and its variants as explained in the following subsections are quite often used for algorithms executed on DSP cores. The core architectures are supporting registers which contain memory addresses and can be used for accessing memory entries. The memory addresses can be stored in specialized address registers only used for this purpose or in general purpose registers. These general purpose registers can also be also used for other operations. Large address spaces can be addressed with less coding effort which has significant influence on code density. The principle of register indirect addressing is introduced in Figure 12.

Figure 12: Principle of Register Indirect Addressing.

Register Indirect Pre/Post Addressing

The addressing mode Register Indirect can be used with Pre/Post Addressing option as illustrated in Figure 13. In particular algorithms executed on DSP architectures process blocks of data and therefore consecutive addresses are used. The Post operation mode allows access to a memory address and afterwards will increment or decrement the address stored in the address register. The pre-address operation allows access to a data memory location with the already updated address.

15

Figure 13: Principle of Pre/Post Operation Mode.

The value for incrementing or decrementing the address located in the address registers can be one (or equal to the granularity of the addressed memory space) or for some core architectures a programmable offset.

LD (R0+), D0 � pre-operation

LD (R0)+, D0 � post-operation

Figure 14: Assembly Code Example for Pre/Post Increment Instructions (xDSPcore).

Pre-operation requires an additional clock cycle for address calculation in many DSP core architectures. xDSPcore supports both modes without requiring an additional clock cycle whereas the related assembly code example is introduced in Figure 14.

Register Indirect Addressing with Indexing

For Register Indirect Addressing with Indexing the content of two address registers is added and the result is used for addressing data memory locations. The difference to the above introduced pre/post modification addressing scheme is that none of the register values is modified.

Two reasons favor this addressing mode: Register Indirect Addressing with Indexing allows the use of the same program code with different data sets. Between different data sets only the index register value has to be set to the start address of the new block of data.

The second reason is the use by compilers communicating arguments to subroutines by bypassing data via the stack. One address register is assigned as stack frame pointer. This means that the subroutine does not have to know the absolute addresses. The transferred arguments are located relative to the stack frame pointer.

Register Indirect Addressing with Modulo Arithmetic

Modulo arithmetic can be used for implementing circular buffers [171]. The data values as illustrated in Figure 15 [78] are located on consecutive addresses in data memory. If the address pointer reaches the end of the circular buffer, specialized hardware circuits are used to reset the pointer to the start address.

16

nnbeginNmn

mld 82,

)( ==∈

mbeginend +=

Figure 15: Principle of Using a Modulo Buffer for Address Generation.

This implicit boundary check reduces the effort for manual control of the buffer addressing. Separate modulo registers are supported to store the size of the chosen buffer.

Some commercially available core architectures support circular buffer addressing with a defined start address aligned to the size of the supported buffer, e.g. a circular buffer with the buffer size of 256 can start at the addresses 0, 256, 512 and so on. The drawback of this implementation is fragmented data memory.

Overcoming the fragmentation problem some core architectures like the SC140 [32] or Carmel [12] support programmable size and start address of the circular buffer which requires an additional base address register and an additional adder circuit for address calculation.

Register Indirect with Bit Reversal

The Register Indirect with Bit Reversal addressing mode is also called reverse carry addressing. This address mode is only used during execution of FFT algorithms. FFT algorithms have the drawback that they either take their input or output values in scrambled order. To complicate matters further the scrambling depends on the particular version of the FFT algorithms [103].

In Figure 16 [78] an example is illustrated. The lower bits of the generated addresses are mirrored and allow scrambling of the addresses as required by FFT algorithms [116].

17

Figure 16: Principle of the Bit Reversal Addressing Scheme.

2.6.6 Short Addressing Modes Code density is a significant factor influencing the consumed silicon area of the core subsystem. Many of the above described addressing schemes use two instruction words for addressing to store the immediate or offset values. To increase code density DSP core architectures support instructions with small immediate values which allow the coding of them into one instruction word.

Short Immediate Data

One example is the short immediate data addressing where a constant as part of the instruction word can be stored in a register. Restricting the data range of the constant to a reduced value allows only one instruction word to be used. Figure 17 illustrates an example supported by xDSPcore.

MOVC 0, d0

MOVCL 1234, d0

Figure 17: Assembly Code Example for Short Immediate Data (xDSPcore).

The short version of the instruction supports a data range for the constant of �32 ≤ constant < 32. For assigning constants exceeding this range a second instruction with the same function is introduced which supports an additional instruction word for storing the immediate value.

Short Memory-Direct Addressing

As mentioned before the use of memory direct addressing or absolute addressing is limited by the required coding space. However some core architectures support this address mode within one instruction word. The address mode can then be used in combination with special features as for example the Motorola 56000 family [20][21] where I/O registers can be addressed with this address mode. The small offset which can be placed in one instruction word (e.g. 6 bits for this example) is extended inside the core to a physical address in the 64k byte address space.

18

Paged Memory-Direct Addressing

This address scheme splits the available address range into address pages. A reduced coding space can be used to access addresses in the page once the page is set. This allows the short version of the addressing schemes to be used. The overhead for the paging mechanism is not negligible and the addressing schemes can only be used to increase code density if the executed algorithm allows data to be mapped to pages. Changing the memory page requires additional instructions (with influence on code density) and additional execution cycles.

2.7 Multiple Memory Banks Traditional algorithms executed on DSP architectures are data flow algorithms. For example data values describing a signal are fetched, processed by digital signal processing algorithms and then stored into data memory.

The implementation of filter algorithms based on MAC instructions as illustrated in Figure 7 requires a fetching of two data values for the multiply operation. The summation of the multiplication results takes place in a local register. To obtain the two independent data fetch operations at least two independent memory ports are required.

Figure 18: Processor Architectures: von Neumann, Harvard, modified Dual-Harvard.

Figure 18 illustrates the principles of von Neumann architecture with a combined data and program memory- the Harvard architecture where data and program are split and also the modified Dual-Harvard architecture with two independent data memory ports [96][113][149]. Some core architectures for example Carmel [10] support the fetching of up to four independent data values which can be used to increase execution speed of filtering or FFT algorithms.

Some commercial DSP cores like the SC140 [32] from Starcore LCC feature one address space for data and program memory which eases the transfer of data between data and program memory. Others including xDSPcore feature separate address spaces.

The X/Y memory splitting as used for OAKDSP [29] is well suited if the two fetched operands are located in two different memory spaces (e.g. for the example in Figure 7). If the fetched operands are located in the same address space the memory operations have to be serialized which will lead to a reduced system performance.

19

The Starcore SC140 [32] features interleaved addressing which can be used to reduce the possibility of memory hazards. In Figure 19 the memory mapping is illustrated. The chosen concept makes use of an implementation aspect. The performance of memory operations is limited, especially when considering large memory blocks. Therefore physically the memory implementation is split into small memory blocks reaching higher clock frequencies.

Figure 19: Example for Interleaved Memory Addressing (SC 140).

The small memory blocks can be accessed separately as illustrated in Figure 19 where 4k physical memory blocks are supported.

2.8 CISC Instruction Sets CISC instructions are built up of several micro-instructions. The Multiply and Accumulate (MAC) [97] instruction introduced in Figure 20 is used as an example for CISC instructions. Two data values are required for multiplication.

A third operand is required for accumulation with the multiplication results. The result of the accumulation operation is stored in an accumulator register, the same used as the third source operand. The example in Figure 20 also illustrates the additional left shift operation required for multiplying fractional data values.

20

Figure 20: Example for CISC Instructions: Multiply and Accumulate (MAC).

If the core architecture is based on load/store then the operands are fetched from a register file. For direct-memory architectures the operands have to be fetched from data memory. This requires memory addressing coded in the instruction word, thus increasing the complexity of the MAC instruction. Some core architectures support rounding and saturation logic as part of the MAC instruction.

Driven by compiler requirements modern DSP core architectures feature RISC instruction sets with some CISC extensions for increasing code density and performance during the execution of filtering and FFT algorithms.

2.9 Orthogonality During definition of what a DSP core is one of the major arguments was that orthogonality aspects are ignored for the sake of increased efficiency in terms of silicon area and power dissipation. The feature Orthogonality is mentioned at least once in white papers, product briefs or even technical documentation of DSP vendors and often in combination with instruction set or core architecture [14][35].

In [116] Orthogonality is defined as �to which extent a processor�s instruction set is consistent�. It is also mentioned that Orthogonality is not so easily measured. Besides the aspect of instruction set consistency the degree that the operands and addressing modes are uniformly available for operations is used as a measure for orthogonality.

Examples for missing orthogonality can be found in existing core architectures e.g. the address registers of the Motorola 56k family [20][21] which are banked. Four of the eight registers are assigned to one Address Generation Unit (AGU) and the remaining four to the second AGU which limits register allocation and instruction scheduling. The SC140 [32] does not allow the higher eight address registers to be used during modulo addressing because the higher address registers are then used as base registers.

In the following some more examples for missing orthogonality are illustrated.

21

Reduced Number of Operations

Reducing the number of instructions relaxes the pressure for coding the instruction set, for example the rotate instruction missing from the Lucent 16xx architecture [19].

Reduced Number of Addressing Modes

Providing all of the known addressing modes for all instructions demands a lot of coding effort inside the instruction set. To increase code density the core architectures only support a subset of the addressing modes and restrict their use to a group of instructions.

Reduced Number on Source/Destination Operands

To allow orthogonal use of registers by each instruction causes large instruction space and decreases code density. Therefore core architects limit the use of some of the registers to specialized functions for example the MAC2VIT instruction of the SC140 [32].

Use of Mode Bits

Most commercial DSP cores make use of mode bits. Depending on the mode indicated by mode bits the meaning of an instruction is changed. Mode bits are often used for specialized addressing modes or saturation and rounding modes. The advantage of increasing code density can be compensated by limitations during register allocation and instruction scheduling. In a later section the problem is discussed in detail.

2.10 Real-Time Requirements Real-time requirements form the last aspect where digital signal processing algorithms have specific requirements and influence on the architecture of Digital Signal. Analyzing micro-architectural improvements in microcontrollers in the last number of years it is apparent that most of the improvements have taken place in cache structures. Cache structures are well suited to reduce the average execution time of an algorithm.

A similar phenomenon has taken place in DSP cores for example at the SC140 of Starcore LCC [32] or at the Blackfin from Analog Devices and Intel [8]. Cache structures have been introduced for data and program memory. The drawback of introducing caches is that a strong requirement of real-time applications is lost: minimizing of the worst case execution time.

The purpose of Worst Case Execution Time (WCET) analysis is the possibility of a priori to determine the worst case execution time of a piece of code. WCET is used in real time and embedded systems to perform scheduling of tasks to determine whether performance goals for periodic events are met, also to analyze for example interrupts and their response time [80]. The main influence on execution time comes from program flow aspects like loop

22

iterations and function calls and architectural features like pipeline structures and cache architectures [80].

In the area of research several algorithms and tools for analyzing the WCET of application code have been introduced [81][91][93]. The program flow analysis for this purpose can be split into a global low-level analysis and a local low-level analysis.

The global low-level analysis considers the effect of architectural features like data [111] [170], instruction cache structures [47][83][94][124][154] and branch prediction [65]. These analyses determine only global effects but do not generate any actual execution time values.

The local low-level analysis handles effects caused by single instructions and their neighbor instructions for example pipeline effects [79][146][155] and the influence of memory accesses on the execution time. The influence of caches onto the WCET is significant as discussed in [50][119][127][135] [168].

If core architectures support instructions with different latency dependent on the input values for example the multiplication instruction of ARM [5] whose execution time can differ between 1 and 4 clock cycles then the calculation of the WCET is more complicated. The multiplication of PowerPC 603 [30][57][58] can even consume between 2 and 6 clock cycles depending on the source operands. In the Alpha 21604 [16][17] the execution ratio of a software division algorithm differs between 16 clock cycles and 144 cycles which implies a ratio of 1:9.

In [144] the contribution of different architectural features to the variation in the execution time and therefore the uncertainty in the WCET analysis are illustrated. The most impact arises from Translation Lookaside Buffer (TLB) accesses followed by data and instruction caches. The influence of instruction execution compared with these dominating aspects is negligible [74].

To summarize caches and prediction algorithms are contra productive to fulfill real-time requirements and therefore to minimize the worst-case execution time. Similar are the requirements for developing an optimizing compiler namely simple issue rules and architectures with few restrictions are preferred as they allow more accurate results.

23

3 DSP cores This section starts with an introduction of the design space of DSP core architectures, the main parameters influencing the design of them and illustrates the limiting parameters which cause the gap between theoretical and practical performance. The second part introduces some architectural alternatives and discusses their advantages and disadvantages. The third part describes commercially available core architectures, starting with cores from the early 1990s up to the latest announcements. This chapter ends with a brief introduction of xDSPcore.

3.1 Design Space This section introduces the possible design space for RISC based core architectures. Today most of DSP core architectures are RISC based load-store architectures. The trade-offs between the main architectural features considering the silicon area, performance and power dissipation are briefly illustrated by some examples. The design space of xDSPcore and the possibilities to influence these parameters by configuration settings can be found in [P10][P11].

The purpose of this section is to illustrate the complexity of choosing the �best core� and to show that there is no general solution [13]. A DSP core is well suited when solving an application-specific problem efficiently in terms of consumed silicon area and power consumption. However it also has to be considered that the overall application partitioning has a significant influence on the costs and that the costs of a product are not only caused by silicon production and packaging. Software development costs, maintenance and portability significantly contribute to the costs of SoC and SiP solutions.

3.1.1 Silicon Area This subsection introduces the main contributors to the silicon consumption of a core subsystem, with special focus on DSP architectures. The instruction set architecture (ISA), with its influence is then chosen as an example aside from core and memory subsystem. This example shall illustrate the complexity and the mutual influence of these aspects.

Core

Increasing system complexity has lead to large programs being executed on core architectures. The contribution of the core area to the die area of the core subsystem is then deemed insignificant. This key number is still taken as a decision point for choosing one particular core. With the increasing complexity of modern day silicon technologies a comparison is then made even more sophisticated. Performance figures for example the core area in mm² requires additional information like chosen technology, silicon foundry, number of metal layers, temperature range and supply voltage.

24

Memory Subsystem

The increasing size of programs executed on embedded core architectures leads to an increasing importance of memory subsystems to the area contribution. Therefore the importance of code density with influence on the program memory area has increased. In the following item the instruction set is taken as an example to illustrate the influence on core and memory subsystems.

Instruction Set

The instruction set of a core architecture can be split into two aspects: the instruction set architecture and the related binary coding. The instruction set is taken as an example to illustrate the cross coupling of different subsystem features. Further examples can be found in the design space discussion of [P10] and [P11].

The instruction set architecture mirrors the functionality supported by the core architecture. For example the support of two or three operand instructions features like addressing modes and saturation modes or complex instructions like division.

Instructions and the related binary coding are necessary to program the available units. The mapping of the instruction set architecture to instructions must consider micro-architectural aspects. It will be difficult to map the ISA onto the native instruction word if the native instruction word size is 16 bits and the ISA requirement is to support three-operand instructions and each operand requires 4 bit register-coding. In this case it is necessary to map the three-operand instructions onto two instruction words or to increase the size of the native instruction word..

Figure 21: Influence of Binary Coding on Application Code Density (using the same ISA).

In Figure 21 the influence of the chosen binary coding is illustrated. The same ISA is once mapped onto 16-bit wide instruction words and once onto 20-bit wide instruction words. To illustrate the influence on code density, a piece of traditional control code is used, for example some PC benchmarks [18]. The results in Figure 21 show that the shorter native instruction word requires an increased number of long-words which are simply additional

25

instruction words for the identical instruction. This is reasonable because the coding space for immediate values and offsets is reduced in 16 bit wide native instructions. The overall code density for this example normalized in bytes is improved by 16 % when using the 16-bit native instruction words, however the result will be different for other application code examples.

3.1.2 Performance The performance of DSP cores is measured in Million Instructions Per Second (MIPS) or Million Operation Per Second (MOPS) [25]. MOPS was introduced when multi-issue core architectures appeared on the market. These numbers are calculated by multiplying the reachable core frequency by the number of instructions executable in parallel. This led to announcements like the Texas Instruments TIC64x [39] with 8 GOPS (the possibility of eight parallel executed instructions multiplied by 1 GHz clock frequency).

Berkeley Design Technologies Inc. (BDTi) introduced the so called BDTi benchmark suite, containing a dozen algorithmic examples. Most of these are based on small loop centric kernels for filtering and vector operations, whereas other examples include a FFT a Viterbi implementation and a control code example. Certain coding requirements restrict the implementation in order to simplify comparison between different core architectures. These small kernels are often not representative for application code executed on DSP cores.

Another possibility to measure performance is counting the Million MAC instructions per second (MMACs). For example during execution of control code the number of possible executable MAC instructions does not significantly influence performance. Micro-architectural limitations (e.g. as illustrated in Figure 24) reduce the accuracy of this performance factor for a mixture of DSP and control code.

Theoretical versus Practical Performance

The example of Texas Instruments can be used to illustrate the term theoretical performance. This outlines the theoretical performance of the 1 GHz TI C62x is 8 GOPS [37] or another example in [133]. The practical performance is a measurement of how efficiently a certain algorithm can make use of the resources provided by the core architecture. Some of the factors limiting the reachable practical performance are introduced in this sub-section for illustrating the gap between theoretical and practical performance.

Define-in-use Dependency

One way to increase the number of MIPS and MOPS is to increase the reachable clock frequency of the core architecture. The increase of possible clock frequency can be attained on a technological level by smaller feature size and an architectural level by increasing the number of pipeline stages. This leads to super-pipelined architectures with 10 pipeline

26

stages and more. Increasing the number of pipeline stages during the execution phase increases the define-in-use dependency [149] as illustrated in Figure 22. The five-stage pipeline of Figure 22 supports split execution where two clock cycles are used for calculation e.g. one MAC instruction. The operands are read at the beginning of EX1 and the result written at the end of stage EX2. Filtering operations for example the FIR filter as illustrated in Figure 7 require consecutive MAC instructions for cycle efficient implementation. Due to the dependency of the result of the first MAC instruction as source operand for the second MAC instruction a NOP cycle is required to prevent data hazards [149]. For the core architecture of Figure 22 which features a lean pipeline structure the additional NOP cycle is reasonable. However the TI C62x [36] provides an eleven stage pipeline containing 5 execution stages where the define-in-use dependency increases significantly. A method to compensate this problem is through bypass circuits (bypassing intermediate results to the next instruction).

Figure 22: Principle of Define-in-use Dependency.

The xDSPcore architecture which utilizes the pipeline structure of Figure 22 allows the fetching of the accumulator operand for the MAC instruction at the beginning of EX2 (as illustrated in Figure 58) which compensates the define-in-use dependency during executing filter operations.

This example shall illustrate that increasing reachable clock frequency by adding additional pipeline stages leads to an increased theoretical performance (due to relaxing of the critical implementation path and by reaching a higher clock frequency) but data and control dependencies in the application code can limit the increase of practical performance.

Load-in-use Dependency

A similar problem is the load-in-use dependency [149] as illustrated in Figure 23. To relax the timing at the data memory ports additional pipeline stages are introduced for memory access. The execution of instructions dependent on the fetched data entries have to be delayed until the memory access has been finished.

27

Figure 23: Principle of Load-in-use Dependency.

Different to the load-in-use dependency bypass circuits cannot be used to partly compensate the data dependency. The load-in-use dependency can cause a significant mismatch between theoretical and practical performance especially during the execution of control code featuring short branch distances.

Data Memory Bandwidth

The data memory bandwidth of a core architecture is characterized by the number of load/store instructions executed in parallel, the size of the data memory ports and the structure of the access. The structure of the access considers alignment requirements at the data memory port and the number of independent addresses which can be generated and accessed each clock cycle. To prevent limitations on the practical performance compared with the theoretical performance the necessary operands for each of the executed instruction has to be provided.

Figure 24: Example for Data Memory Bandwidth Limitations (Starcore SC140).

The core architecture as illustrated in Figure 24 allows execution of up to four MAC instructions in parallel (e.g. the SC140 [32]) which can be used to increase performance during the execution of filter algorithms. However for each of the four MAC instructions two operands on each clock cycle are required. The example in Figure 24 enables fetching of two independent data values from data memory each cycle. The memory bandwidth for executing the four MAC instructions in parallel is sufficient when fetching two times 64-bit data and assuming 16-bit wide operands for the MAC instructions. The structure in Figure

28

24 illustrates a limitation of storing data in data memory. The data has to be placed in data memory so that it is possible to fetch operands for all four MACs in parallel by addressing only two independent data entries. This limitation can require a large amount of operations to position the data according to the required scheme which is normally not assumed for benchmark results e.g. [9][28].

Program Memory Port

The program memory port is used to fetch instructions from program memory. Multi-way VLIW architectures require a large amount of instruction words to enable programming of the available parallel resources. An example is the Texas Instruments TIC6xx family [36] which requires a program memory port width of 256 bits requiring significant wiring effort. In combination with the poor code density of the TI6xx family its usage in area and power critical applications is not recommended. Therefore core architects have introduced architectural features to prevent large program memory ports.

Providing a small program memory port requires less wiring to the program memory but leads to poor usage of the available parallel units. During the execution of control code this limitation is reasonable, because data and control dependencies limit the average ILP to 2-3 as illustrated in [106][112][115][159]. Loop-centric algorithms often used for typical DSP functions can make use of more parallelism and therefore the peak-performance of the core architecture would be limited by the reduced size of the program memory port. For increasing peak performance of the core architectures, extended program memory ports have been introduced [157].

Branch Delays

Branch delays are unusable execution cycles caused by taken conditional branch instructions. Increasing clock frequency by increasing the number of pipeline stages increases the number of branch delays and therefore decreases the practical performance. Compared with single issue microcontroller cores this is further deteriorated when executing control code with short branch distances on multi-way VLIW architectures. Branch prediction circuits as introduced in [31][44][173] can be used to reduce the number of branch delays but the drawback of prediction circuits has already been pointed out in section 2.10.

An alternative to compensate branch delays is through trying to prevent branch instructions by making use of predicated or conditional execution. In [132] benchmark results illustrate that the use of predicated execution can be used to reduce about 30% of conditional branch instructions. The chosen implementation for predicated execution has influence on the practical performance. An example is the implementation used for SC140 [32] with only one flag and few conditions can lead to a poor usage of the resources during control code sections and several unused execution cycles. This limitation is caused by restrictions

29

during instruction scheduling. Scheduling of instructions between generation and evaluation of the status information is not allowed. For multi-way VLIW architectures featuring deep pipelines the gap between the theoretical and practical performance can increase significant.

3.1.3 Power Consumption The power consumption of a core subsystem is influenced a number of factors, namely the core architecture itself, the memory subsystem, execution frequency and the executed algorithms which have influence on the traffic on the data memory bus [139][143]. This section considers power consumption aspects where embedded core architectures can contribute to reduce power dissipation, however the technology aspects are not considered in detail.

Power dissipation in CMOS circuits is mainly caused by three sources, leakage current, short circuit current and charging and discharging of capacitive loads during logic changes.

P = Pleak + Pshort+ Pdynamic [1]

Leakage current is primarily determined by the fabrication technology and circuit area. The short circuit currents can be avoided by careful design [59][62][72][158][161], and the same is true for leakage [70][73][85][104][141][165].

Three degrees of freedom are an inherent part of the system low power design space: voltage, physical capacitance and data activity. These factors will be briefly discussed in this sub-section. Equation 2 contains the factors which mainly influence dynamic power consumption [110].

iiisw TDCVP 2

21

= [2]

Voltage

The quadratic relation between voltage and power dissipation favors this parameter as an effective possibility for reducing power dissipation. Voltage scaling influences not only one part of the SoC solutions where system aspects have to be carefully as with decreasing supply voltage a speed penalty is evident [56][107][131][156].

In [61] an architecture driven voltage scaling strategy is presented whereas pipelined and parallel architectures are used to compensate the throughput problem caused by reduced supply voltage. A different approach is illustrated in [175].

Another possibility to compensate the speed decrease caused by reduced supply voltage is to decrease Vt. This is limited by constraints of noise margins and by the need to control the increase of sub-threshold leakage current. Dual-Vt techniques such as those introduced in [107] require multi-threshold CMOS transistors (MTCMOS) which have to be supported by the target technology.

30

Physical Capacitance

Dynamic power consumption depends linearly on the switched physical capacitance. Therefore besides reducing supply voltage a reduction of the capacitance can be used to reduce power dissipation. Using less logic, small devices and short wires the physical capacitance can be reduced.

On the other hand as already mentioned in voltage scaling it is not possible to minimize one parameter without influencing some others for example reducing the device size will reduce the current drive of the transistors in turn resulting in slower operating speed at the circuits.

Switching Activity

Reducing switching activity also linearly influences dynamic power dissipation. A circuit containing a large amount of physical capacitance will show no power dissipation when there is no switching.

However, calculating switching activity is not simple. This is caused by the fact that switching consists of spurious activity and functional activity. In certain circuits like adders and multipliers [52] spurious activity can dominate.

Combining data activity with physical capacitance leads to switched capacitance describing the average capacitance charged during each data period.

Summary

The design space for low power design is mainly influenced by the following parameters; supply voltage, capacitance and switching activity which are cross-related to each other and have influence on static power dissipation. For an embedded DSP core the design space is even more limited because aspects like voltage scaling or dual-VT techniques which are system or technology aspects and thus cannot be influenced by the core architecture itself. The DSP core architecture introduced in this thesis supports architectural features [P5] and compiler related aspects [99] for reducing switching activity. Implementation aspects to reduce capacitance are considered by making use of manual full-custom design [P4].

31

3.2 Architectural Alternatives DSP cores are processors that provide specific features for efficient implementation of algorithms for digital signal processing as illustrated in section 2. Each core architecture aims to solve specific problems whereas an efficient architecture meets the requirements of the algorithm executed. Meeting the requirements can be subsumed in the key features area consumption leading to costs, low power dissipation leading to increased battery life time or higher integration density and system development costs, which are mainly dominated by software development costs. In this section some architectural alternatives used in current DSP core architectures are briefly introduced.

The solution space is multi-dimensional and different parameters have mutually coupled influence upon the space. More details concerning the available design space for DSP core architectures and a methodology how to find the best solution for solving a certain application-specific problem can be found in [P10][P11][13].

3.2.1 Single Issue versus Multi Issue Single issue architectures invoke only one instruction each execution cycle. This concept is well established for microcontroller architectures for example ARM microcontrollers. The problem of efficient instruction scheduling is simplified to a linear problem and programming a single issue core is straight forward. Control code typically executed on microcontroller architectures is linear code with a lot of dependencies and therefore executing more than one instruction per execution bundle (instructions executed during the same clock cycle) does not significantly increase the performance of the core architecture. To increase the performance of these core architectures more complex instructions can be used [20][29].

Figure 25: Architectural Alternatives: Issue Rates for Available DSP Core Architectures.

32

DSP algorithms are loop-centric algorithms where a significant amount of execution time is spent in loop iterations. Therefore increasing performance during execution of the loop bodies significantly increases the performance of the core architecture. Software pipelining and loop unrolling as introduced in a later section allow execution of several instructions in parallel to increase system performance.

In Figure 25 the issue rate of available DSP core architectures over time is illustrated. While most of the core architectures in the 1980s allowed the execution of one instruction per execution cycle only 10 years later up to 8 instructions were able to be executed. Core architects have increased the number of instructions executed in parallel to increase relative performance of their core architectures.

There are also other aspects to consider for example the Instruction Level Parallelism (ILP). The average ILP indicates the average number of instructions executed in parallel. The ILP is limited by the core resources by the issue rate, and data and control dependencies in the executed algorithm. The issue rate means that a single issue core cannot reach a value of more than one. Increasing the possible number of instructions executed in parallel will not increase the average ILP when executing an algorithm primarily based on control code. For loop-centric algorithms the increased parallelism can be used for increasing core performance.

It is nearly impossible to develop code for a multi-issue DSP core architecture manually considering deep pipelines and related dependences, therefore the use of high-level language compilers like a C-compiler is required.

3.2.2 VLIW versus Superscalar Scalar and superscalar architectures are common for microcontrollers. Scalar processors support the execution of one instruction per cycle which limits the attainable performance. Superscalar processor architectures overcome this problem by supporting the execution of several instructions in parallel where resolving of dependencies in the executed application code is done by hardware circuits. Issuing queues [149], score boards [149] and highly sophisticated branch prediction circuits [153] take care of making efficient use of the core resources. The programming model is based on dynamic scheduling, i.e. the execution order of instructions is defined during run-time based on dependency analysis [148]. Superscalar architectures allow minimization of the execution time by enabling a change in the program execution order as long as dependencies are considered. This minimizes average execution time.

The Very Long Instruction Word (VLIW) programming model is based on static scheduling. Dependencies in the application code are already resolved during compile time. The execution order of instructions is not changed during runtime. Changing the execution order

33

during run-time is not possible, due to lack of support of hardware circuits for dependency resolution not supported in VLIW architectures. The advantage is reduced core complexity which simplifies hardware development. Using caches for VLIW architectures leads to penalty cycles during cache misses which cannot be used to execute different code sections. One possibility to overcome this limitation is the invoking of multithreading with the drawback of increased core complexity. Static scheduling allows minimization of the worst-case execution time which is required for algorithms with real-time requirements. Developing a C-Compiler for VLIW architectures is more complex because of dependency analysis and sophisticated instruction scheduling algorithms [45][166].

Most of the latest DSP architectures are based on the VLIW programming model driven by the real-time requirements of algorithms executed on DSP architectures. To overcome the drawback of traditional VLIW having poor code density, enhanced implementations like Variable Length Execution Set (VLES) [36], Configurable Long Instruction Word (CLIW) [157] and scaleable Long Instruction Word (xLIW) [P2] are used in existing core architectures.

3.2.3 Deep Pipeline versus Lean Pipeline Pipelines were already introduced in supercomputers in the 1960s and the motivation for pipelining is to increase instruction throughput by an increased usage of hardware resources. This is achieved by splitting of operations into sub-operations and invoking of new sub-operations as early as possible. This split into sub-operations allows the reaching of higher clock frequencies. In Figure 39 the concept of SW-Pipelining is illustrated for four operations. The main concept is the same for hardware and software pipelines.

Pipeline structures used in DSPs are CISC and RISC pipelines. Direct memory architectures are based on CISC pipelines. Besides fetching, decoding and execution of instructions typical for RISC pipelines the memory operation for fetching operands from data memory requires additional pipeline stages which are inserted after the decode stage. In Figure 26 the two types of pipelines are illustrated. In RISC pipelines separate instructions are used to fetch data from data memory and these instructions use the same pipeline structure. In CISC pipelines the memory operations are part of arithmetic instructions.

34

Figure 26: Architectural Alternatives: RISC versus CISC Pipeline.

Besides the chosen pipeline structure (which is mainly influenced by the general architectural concept) the number of clock cycles used to implement the pipeline is an important performance aspect.

To increase the computational power of core architectures, splitting operations in small sub-operations and thus using several clock cycles to execute one �natural� pipeline stage leads to super-pipelined architectures. Dependencies between pipeline stages (a more detailed discussion can be found in the Design Space section) can lead to a poor usage of available hardware resources.

Figure 27: Architectural Alternatives: Pipeline Depth of Available DSP Cores.

The worst case scenario can even be an increased clock frequency which produces high power dissipation but reduced system performance due to data and control dependencies. In Figure 27 the pipeline depth for available core architectures is illustrated. For overcoming data and control dependencies caused by deep pipeline structures bypass and branch prediction circuits are introduced.

3.2.4 Direct Memory versus Load-Store For load-store architectures separate instructions are used to transfer data between data memory and register file. For direct-memory architectures the data transfer is coded inside

35

the arithmetic instruction. For load-store architectures the register file plays a central role and therefore is traditionally located between data memory and execution units whereas for direct memory architecture the register file is in parallel to the execution units where intermediate results can be stored. The difference between load-store and direct memory architecture is illustrated in Figure 28.

Figure 28: Architectural Alternatives: Direct Memory versus Load-Store.

Assuming traditional DSP algorithms e.g. filtering, the direct memory architecture allows an increase in code density by using less instruction words. The load/store operations are already included inside the coding for the arithmetic instructions, however the coding space for the data transfer has to be provided inside the instruction word thus leading to more complex instruction words for example the 24-bit instruction word of Carmel [11]. For code sections which cannot make use of the more complex instructions, code density is decreased. The execution of control code with CISC instructions features the problem of poor usage of the binary coding and therefore decreased code density. The application code requires more instruction words when using less complex instructions. These provide more flexibility and can be used to increase code density on application level.

3.2.5 Mode Register versus Instruction Coding Memory dominates the area consumption of embedded DSP subsystems. Code density is a factor mirroring how efficiently a certain algorithm can make use of the provided core resources and the related instruction set architecture and instruction coding. High code density reduces the required program memory and therefore the necessary die area of the DSP subsystem.

One possibility to increase code density is the usage of a mode register. The mode register allows the meaning of an instruction to change by modifying the related mode bit. Quite often examples [32] are mode bits for addressing modes or saturation modes.

The disadvantage of these mode registers is caused by instruction scheduling. In Figure 29 the problem is clearly illustrated. An instruction is not necessarily dependent only on the

36

instruction word. The meaning of the instruction also depends on the mode set for a certain code section.

Figure 29: Architectural Alternatives: Mode Dependent Limitations during Instruction Scheduling.

It is not possible to move an instruction out of a section without considering changing the mode for the other section (which is again related with additional instructions � the new mode has to be set and reset). This is impossible for multi-issue VLIW DSP core architectures. Therefore mode register can help to increase the code density of a small kernel but the reduced freedom for scheduling of instructions can lead to an increased number of execution cycles and even to a decreased code density.

37

3.3 Available DSP Core Architectures This section introduces commercially available DSP cores. Initially each core architecture is introduced and the main aspects shall be discussed, such as available arithmetic units, pipeline structure, supported addressing modes and core specific features. At the end of each description a short summary section briefly assesses the features of the core architecture from an orthogonality point of view which is a major aspect for developing a C-compiler. This section does not contain a table comparing available DSP cores on metrics like reachable frequency, number of parallel executed instructions or pipeline depth in order to prevent superficial comparisons. These kinds of tables hide micro-architectural limitations with influence on practical performance as introduced in 3.1.2.

DSP cores are quite often rated to their number of supported MAC instructions each clock cycle. The first three cores described in this section are OAKDSP, Motorola 56000 and TIC54x, chosen as examples of so called single-MAC DSP cores (DSP core architectures supporting the execution of one MAC instruction at a time). The ZSP has been chosen as an example of a DSP core based on the superscalar programming model. As examples for dual-MAC architectures Infineon Carmel DSP, TI C62x and Blackfin have been chosen. As a last example the SC140 of Starcore LCC has been chosen as a DSP core architecture supporting the execution of four MAC instructions in parallel. This thesis does not consider vector processor architectures due to their specialized architecture which can only be used for one class of algorithmic problem.

3.3.1 OAKDSP The OAKDSP [29] core was introduced in the early 90�s from DSPgroup (now Ceva [14]) as a successor to the PineDSP core. OAKDSP core is a single-issue 16-bit DSP core based on traditional direct memory architecture. This is where arithmetic instructions fetch their operands from memory and store the results into memory. The instruction set is based on a native 16-bit instruction word, and long immediate values or offsets are stored in an additional instruction word. The pipeline consists of four stages: fetch, decode, operand fetch and execution. The data memory space is split into an asymmetric X and Y memory space. The Computation Unit (CU) as illustrated in Figure 30 contains a sixteen by sixteen multiplier (also supporting double precision), an ALU/Shifter data path for implementation of MAC instructions and a separate Bit Manipulation Unit (BMU) containing a barrel shifter which is the major difference to PineDSP.

38

Figure 30: Architectural Overview: OAKDSP Core.

Four shadowed accumulator registers each 36 bits wide containing four guard bits are supported. Two of these are assigned to the BMU and two to the CU. Zero overhead loop instructions with four nesting levels are supported. A software stack allows execution of recursive function calls.

The address generation unit supports post increment/decrement operations and modulo address generation. Reverse carry addressing is not supported. Three status registers are available containing flags, status bits, control bits, user I/O and paging bits. Most of the fields can be modified by the user. The first status register contains the flags (zero, minus, normalized, overflow, carry �) which are influenced by the last CU or BMU operation.

Most of the registers are shadowed which allows a task switch with reduced spilling of the core status to data memory. No separate interrupt control unit is available and three different interrupt sources and an NMI are supported.

Figure 31: Architectural Overview: Motorola 56300.

39

Summary: OAK DSP core is a single-issue DSP core which thus reduces its relative performance. The limited address space requires a paging mechanism with the related limitations for instruction scheduling. Compared with modern core architectures the reduced feature set enables good code density for typical DSP algorithms like filtering. Missing support for reverse carry address mode limits the possibility of an efficient implementation of FFT algorithms on the OAKDSP core. Status and configuration registers limit instruction scheduling. Flags influenced by the last occurrence of ALU or BMU instructions limit the use of conditionally executed instructions. The missing support of nested interrupts is caused by only one level of shadow registers and limits the usage of interrupt service routines. Although the support of a software stack eases the development for a C-compiler, the architectural features of OAKDSP are not orthogonal and therefore implementing a powerful C-compiler is questionable

3.3.2 Motorola 56300 The Motorola 56300 DSP [22] core is a powerful member of the Motorola 56k family [20] introduced in 1995. The 56300 is a single issue 24-bit load/store architecture where arithmetic instructions fetch their operands from two operand registers X and Y (each 56 bits wide).

The native instruction set consists of 24-bit wide instruction words and long offsets or immediate values are stored in an additional instruction word. The instruction format can be split into a parallel instruction format and a not parallel instruction format. The parallel instruction format supports CISC instructions: in addition to the operation op-code and operand, operations taking place on the X and Y memory bus and a condition can be coded.

The pipeline consists of seven stages: Prefetch I + II, decode, address generation I+II, execute I+II, all of which are hidden from the programmer. The memory space is split into X and Y memory. The computation unit as illustrated in Figure 31 contains a 24 by 24 bit multiplier, an accumulator including a shifter and a separate bit-field unit including a barrel shifter.

The register file consists of 6 accumulator register Ax and Bx with each being 56 bits wide. The supported data types are 24-bit based (also including byte support) in addition 8 guard bits support higher precision calculation. Zero overhead loops are supported and can be nested. The nesting level is not limited because loop handling registers are spilled to the software stack (limited only by the available data space).

The 56300 supports register direct and indirect address modes (including pre-and post operation) and specialized address modes used for efficient implementation of traditional DSP algorithms, namely reverse carry which allows efficient implementation of FFT algorithms and modulo addressing. The size of the modulo buffer stored in the modulo

40

register is configurable whereas the start address has to be aligned. The address register file is banked which means half of the registers are assigned to one of the two Address Generation Units (AGU).

The actual core status is stored as flag information like carry, zero and overflow in a status register. Mode registers are available for choosing saturation mode and rounding mode and an operation mode register is available for determining the status of the core (e.g. stack overflow). Four interrupts and one NMI are supported.

Although the 56300 is a 24-bit DSP processor, a compatibility mode supporting 16-bit data format is available. The unused bits are cleared or sign extended, depending on the position.

Summary: The 56300 from Motorola supports a 24-bit datapath and therefore is well suited for audio algorithms. The seven-stage pipeline allows the reading of higher clock frequencies but dependencies in the application code can lead to limited usage of the clock cycles. Configuration and mode register limit instruction scheduling. Even if the parallel instruction format allows an increase in performance of the DSP core architecture it is still a single issue core. Control code sections in particular will suffer from poor code density by 24-bit native instruction word size. The supported address modes allow efficient implementation of traditional DSP algorithms including FFT. The use of the address registers is limited by the banked implementation of the address register file.

3.3.3 Texas Instruments 54x Texas Instruments introduced two major DSP families, namely the embedded core family TI C5xx and the high performance stand-alone DSP family TI C6xx as outlined in [39]. The TI C54x [33] has been chosen as example as illustrated in Figure 32. There are several members of the core family available e.g. [34] which provide different features but the TI C54x has been chosen because it is still the most referenced DSP core. Berkley Design Technologies Inc. (BDTi) normalizes its performance figures of analyzed DSP cores on the performance figures of the TI C54x [9].

TI C54x is based on direct memory architecture and supports three data busses and an independent program memory bus, each of which are 16 bits wide. Arithmetic instructions include the operand fetch instructions from data memory. The native instruction word is 16 bits wide. Several instructions require a second instruction word (e.g. for branch instructions). The second instruction word has to be fetched sequentially and therefore some pipeline cycles remain unused. Conditional execution is supported which reduces the number of branch delays.

The pipeline consists of six stages: pre-fetch, fetch, decode, access, read and execute. Executing branch instructions results in three branch delays, the first delay is caused by executing the branch instruction consisting of two instruction words itself whereas the next

41

two executing cycles have to be flushed. To overcome this problem TI C54x supports delayed branching, which allows the use of the branch delays with unconditionally executed instructions. The Central Processing Unit (CPU) contains a 17 by 17 multiplier which supports double precision arithmetic, an adder and a barrel shifter.

Figure 32: Architectural Overview: TI C54x.

The DSP core architecture features two accumulator registers each 40 bits wide including 8 guard bits for internal high-precision arithmetic. Zero overhead hardware loop instructions are supported.

TI C54x supports several addressing modes including absolute and indirect addressing. The specialized address modes reverse carry and modulo addressing are also available. Therefore an efficient implementation of FFT algorithms is possible.

Flags indicating the status of the core architecture are stored in core status registers. To increase code density some of the functions like saturation of multiplication results are stored in mode registers. TI C54x supports several hardware and software interrupts including prioritization of different interrupt sources.

Summary: The core architecture of Texas Instruments TI C54x is well suited for efficient implementation of traditional DSP algorithms. Therefore it is quite often used as a reference concerning code density and power dissipation. The reachable performance is limited by the single-issue execution logic, a small-sized program memory port which leads to stalls by executing instructions consisting of two instruction words, and by missing a register file. This requires the fetching of operands from data memory for each arithmetic instruction. Only a few functions are stored in mode and status registers, therefore instruction scheduling is less limited. Dynamically generated flags are stored in status registers which limits instruction scheduling when using predicated execution (i.e. no instructions are allowed to be scheduled between generating the condition and the conditionally executed instruction). The support of delayed branching reduces the drawback of branch delays.

42

3.3.4 ZSP 400 The ZSP 400 DSP [42] core architecture illustrated in Figure 33 is a family member of the ZSP DSP family of LSI Logic. Different to the remaining core architectures introduced in this section ZSP uses the superscalar programming model. The core is based on a RISC load-store architecture where arithmetic instructions get their operands from the register file. Separate instructions are used to fetch and store data from data memory.

Figure 33: Architectural Overview: ZSP400.

The instruction set is based on a native 16-bit instruction word. The core supports the execution of up to four instructions each execution cycle with some limitations concerning the grouping of instructions. Therefore ZSP is a multi-issue DSP core where dependencies are resolved during run-time (dynamic scheduling). The pipeline of the ZSP 400 contains 5 stages: fetch/decode, group, read, execute and write-back. The data memory bandwidth is four words wide and a cache is located between memory and core and the communication is established via data links. During data memory access no alignment restrictions have to be considered. The Computation Unit consists of two MAC units and two ALU paths, each 16 bits wide. It is possible to combine them as a single 32-bit ALU path.

The register file is built up of sixteen 16-bit wide general purpose registers. Two 16-bit registers can be addressed as one 32-bit register. Two of the 32-bit registers contain additional 8 guard bits used for internal higher precision calculation. Eight of the 16-bit registers are shadowed and switching between the two sets is done with a configuration bit.

ZSP 400 supports two circular buffers and no explicit address registers are supported. The first 13 registers can be used for reverse carry addressing which enables efficient implementation of FFT algorithms. The load/store instructions are supported with auto-increment and offset address calculation usual for state-of-the-art DSP architectures.

43

Mode registers are supported to enable saturation and rounding modes. Several other core functions are controlled by additional configuration registers. Status registers contain core status information like hardware flags indicating for example overflow, zero or pending interrupts.

Summary: ZSP 400 is an example of the ZSP family form LSI Logic. The core is based on superscalar programming model. This implies dynamic instruction scheduling during runtime and the intensive usage of cache structures which limits the possibility of minimizing the worst-case execution time. The advantage of a unique register file for instruction scheduling is counteracted by a huge number of restrictions and non-orthogonal architectural features for example only a few registers are shadowed. Some can be used for any addressing mode, however some do not support all of them. To increase code density several typical DSP functions like saturation and rounding modes are shifted to mode registers with limitations for instruction scheduling.

The ZSP architecture is well suited for implementing a C-compiler due to the superscalar architecture which eases the compiler development. Several restrictions and non-orthogonal architectural features limit the possibility of an optimizing compiler. Comparing ZSP with traditional DSP core architectures gives the indication that ZSP is not a typical DSP core. It is more a microcontroller with some features used in DSP core architectures (like address modes, MAC units and circular buffers).

3.3.5 Carmel Carmel DSP [10][11][12] core was introduced in mid-1990�s by Infineon Technologies, the former Siemens Semiconductor group. Carmel is a 16-bit fixed point direct-memory architecture where arithmetic instructions fetch their operands from memory locations. This is reflected in the 8-stage pipeline: program address, program read, decode I+II, data read address, operand fetch, execution and write address, data write.

The native instruction word size is 24 bits and instructions are built of up to two instruction words. The instruction coding is code density optimized which requires two pipeline stages for instruction decoding. Carmel is based on the VLIW programming model. The implementation of the program memory port is patented as Configurable Long Instruction Word (CLIW) [157]. The regular program memory port is 48 bits wide. An extended memory port of 96 bits allows the fetching of up to 144-bit instruction words. The CLIW memory contains parameterized instruction combinations. Some of the instructions are only supported as part of CLIW instructions.

44

Figure 34: Architectural Overview: Carmel.

Carmel supports the fetching of data from up to 4 independent memory locations. The data memory is split into A1, A2, B1 and B2 memory blocks. The execution unit as illustrated in Figure 34 contains two data paths, each of which contain a MAC unit and an ALU. The left path additionally supports a shifter and an exponent unit. The results can be stored into an intermediate register file of six 40-bit wide accumulator registers or directly to data memory via two 16-bit wide write ports.

Zero-overhead hardware loops are supported with a nesting level of four. Similar to OAKDSP fast context switch is supported with the support of two secondary accumulator registers.

Carmel supports 16-bit address space with traditional addressing modes found in DSP cores. Efficient implementation of FFT algorithms is enabled by the support of bit-reverse addressing mode. The first type of modulo addressing scheme is supported with aligned boundary addresses. A second modulo addressing mode allows prevention of memory fragmentation by the support of non-aligned boundary addresses.

Configuration registers are used to choose rounding mode to activate saturation and to enable fractional data format. Conditional execution is supported for most of the instructions and can utilize two condition registers.

Summary: Carmel DSP core is the 16-bit embedded DSP core created by Infineon Technologies in cooperation with the Israeli company ICCOM. The traditional direct-memory architecture favors CISC instructions. To increase code density and execution speed for traditional DSP algorithms like filtering and FFT any orthogonality aspects have

45

been ignored. The extended program memory port with the restriction that some instructions can only be used with this port limits the development of an optimizing C-Compiler. The necessary two pipeline stages for decoding increases the number of branch delays. Configuration registers for major DSP functions like saturation and rounding modes are limiting instruction scheduling. Considering Carmel as a high performance DSP core the limitation of a 16-bit address space is crucial and requires a paging mechanism with its related drawbacks. The conditional execution supported by Carmel limits instruction scheduling by supporting only two registers for storing conditions.

In 2002 Carmel was sold to Starcore LCC and in the same year Carmel was discontinued. This is an example of typical DSP core developments of the mid 1990s and one of the last examples for direct-memory architectures. Carmel�s BDTi benchmarks still have the leading edge for traditional DSP algorithms and especially for FFT algorithms.

3.3.6 Texas Instruments 62xx The Texas Instruments C62x [36][37] is a family member of the Texas Instruments C6xx high performance DSP core family. TI C62x is a 16-bit fixed point multi-issue DSP core based on RISC load-store architecture. Operands for the arithmetic instructions are fetched from register file and separate instructions are supported to move data between register file and data memory. The instruction is based on a 32-bit native instruction word. Up to eight instructions can be executed each clock cycle. The programming model is based on VLIW. To overcome the drawback of poor code density Texas Instruments introduced the Variable Length Execution Set (VLES) which allows decoupling of the fetch and execution bundles.

Two independent data busses as illustrated in Figure 35 connect the register file with data memory. The execution unit supports eight different units, two for data exchange with data memory, two multiplier, two ALU units and two shift and bit manipulation units. Each of the units contains many different features which overlap as explained in [36] and allows the shifting of functions from one unit to another. There is no explicit support of Multiply and Accumulate (MAC) which requires two instructions one for programming the multiplication and one for the accumulation.

The register file contains sixteen 32-bit wide general purpose registers. It is possible to store 40-bit values in two consecutive 32-bit registers. The pipeline consists of three phases fetch, decode and execute which are split over 11 clock cycles. The pipeline of TI C62x can be called super-pipelined.

TI C62x provides full predicated execution which means that each instruction can be executed conditionally. Six registers of the register file can be used to build the condition.

46

Figure 35: Architectural Overview: TI C6xx.

Summary: The Texas Instruments TI C62x as an example of the C6xx family is a high performance processor architecture. A large register file and up to eight instructions executed in parallel provide an impressive peak performance. The deep pipeline structure allows high clock frequencies to be reached but long define-use and load-use dependencies can result in poor usage of available resources (details in the Design Space section). Some of the characteristics for DSP architectures like 40-bit accumulators or MAC units are not available. These specialized functions are emulated, for example using two 32-bit registers to implement an accumulator or combining a multiplication with an accumulation to realize a MAC instruction. The predicated execution is a powerful feature for implementing control code by reducing the number of branch instructions with the unusable branch delays. However, the limitation to a few of the available registers for building up the condition restricts the use of this feature. An additional limitation for instruction scheduling is the banked register file with one path for transferring data between the two banks. An important drawback not to be overlooked is the poor code density. It is not feasible to use the core as an embedded core and most of the applications making use of it use the TI C62x as a standalone device with external memory. The described DSP core is well suited for development of an optimizing C-compiler.

3.3.7 Starcore SC140 The Starcore SC140 [32] is the high performance DSP core of the Starcore DSP family. Starcore LCC has been founded by Motorola and Agere, the former semiconductor group of Lucent Technologies. Infineon Technologies joined this cooperation just two years ago. The SC140 is a multi-issue high performance 16-bit fixed-point DSP core based on RISC,

47

nearly �pure� load-store architecture. Most of the instructions get their operands from the register file. However a few get operands directly from memory.

The instruction set is based on a 16-bit wide native instruction word where the instructions are 16, 32 or 48 bits wide. Up to six instructions can be grouped to a Variable Length Execution Set (VLES). Some limitations exist concerning grouping. The SC140 features a five-stage pipeline: pre-fetch, fetch, dispatch, address generation and execution. Two independent data busses connect the core with data memory. The memory addresses are interleaved to reduce the possibility of an address conflict and related stall cycles.

The CU of the SC140 as illustrated in Figure 36 consists of four independent data paths, each of which support the execution of a MAC instruction or an ALU operation including shifting. During execution of filter algorithms the available four MAC units provide significant peak performance.

Figure 36: Architectural Overview: Starcore SC140.

The register file consists of sixteen 40-bit entries. The accumulator registers contain 8 guard bits for internal higher precision calculation. Zero overhead looping is supported up to a nesting level of four.

Two independent address generation units support address modes available on state-of-the-art DSP cores and due to reverse carry support an efficient implementation of FFT algorithms is possible. The modulo addressing support allows the addressing of modulo buffers with any size starting at any position, which prevents fragmented data memory.

Mode registers for coding special addressing, rounding and saturation modes are available. Only one flag (T) is available for dynamic evaluation of the core status used for conditional or predicated execution. The SC140 supports a separate interrupt control unit (ICU) with a similar powerful feature set as available for microcontroller architectures [15].

48

Summary: SC140 is the powerful DSP core architecture of the Starcore DSP family, where the support of four independent MAC units increases peak performance during execution of traditional DSP algorithms like filtering. The support of one flag significantly limits the use of predicated execution for improving execution of control code on the DSP architecture. The grouping mechanism to identify the VLES decreases code density. Benchmark results illustrate poor code density [28]. The register file used for address generation is not fully orthogonal and making use of some address modes limits the use of all registers. Mode registers restrict instruction scheduling. Some specialized instructions are limited to certain source and destination registers, which limits register allocation and instruction scheduling.

Figure 37: Architectural Overview: Blackfin.

3.3.8 Blackfin DSP Blackfin DSP [1][2][8] core is co-developed by Intel Inc and Analog Devices. Blackfin is a high-performance 16-bit fixed-point DSP core based on a RISC load-store architecture. Instructions are available for transferring data between register file and data memory whereas arithmetic instructions receive their operands from the register file. The register file consists of 8 32-bit wide entries, each of which can be addressed as two 16-bit entries. Two of the eight 32-bit register entries are extended by eight guard bits each and used as accumulator registers for internal higher precision calculation.

The instruction set is based on 16-bit wide native instruction words and instructions are 16, 32 and 64 bits wide. The fetch bundle contains 64 bits. Blackfin features an eight-stage pipeline: Fetch I+II, decode, address generation, execute I+II+III and write back. Nested loops with a nesting level of two are supported.

49

The execution unit of Blackfin as introduced in Figure 37 contains two 16 by 16 bit multipliers, two 40 bit wide ALU datapaths and one shifter unit. Typical DSP address modes are available including circular buffers (with no restrictions to the start address and the buffer size) and reverse carry addressing which enables efficient implementation of FFT algorithms.

Status registers are available which mirror the core status including hardware flags and also configuration details for example the rounding mode.

Summary: Blackfin (ADSP-21535) is a high performance 16-bit DSP processor developed by Analog Devices and Intel. The register file consists of only two accumulator registers and the remaining registers are 32 bits wide, which limits instruction scheduling. The core description emphasizes the topic of cache architectures like L1 and L2 cache architectures and cache is provided for data and program. The main problem of cache architecture is the unpredictability of cache hit and cache miss events which reduce the possibility of reducing worst case execution time for real-time critical algorithms.

3.4 xDSPcore xDSPcore is a fixed-point embedded DSP core architecture based on modified Dual-Harvard load-store architecture. A brief architectural overview of the core architecture can be found in Figure 38. The bit-width of the datapath is parameterized whereas the first implementation will have a 16-bit datapath. The operands for the arithmetic instructions are fetched from register files and the results stored in the register files. Two independent data memory ports are used to transfer data values between data memory and register file.

The native instruction word size is also parameterized. The first implementation contains a 20-bit wide instruction word which allows the coding of all 3-operand arithmetic instructions within one instruction word. A parallel word is used to store long immediate or offset values, but a rich set of short addressing modes enables high code density. The chosen programming model is VLIW and for overcoming the drawback of code density a scalable Long Instruction Word (xLIW) is introduced [P2][P3][P4].

For the core a RISC pipeline is chosen with three phases, namely instruction fetch, decode and execute. The number of clock cycles used to implement the structure can be parameterized. The first implementation contains a five stage pipeline implementation: fetch, align, decode and execute I+II.

50

Figure 38: Architectural Overview: xDSPcore.

The register file is split into three parts, a data register file containing eight accumulator registers, each of which is 40 bits wide. The accumulator without guard bits can be addressed as a 32-bit long register, which itself can be accessed as two 16-bit data registers. The number of entries can be scaled. The second part of the register file contains eight address registers and related modifier registers which are used for bit-reversal addressing and modulo addressing scheme. The third register file called branch-file contains the flags and reflects the core status used for conditional branch instructions and predicated execution. The register files are orthogonal and no register is assigned to specific functions.

Zero overhead loop instructions are supported with a scalable nesting level whereas the first implementation supports 4 nesting levels. Further nesting levels require spilling of loop counter and loop addresses.

xDSPcore supports addressing modes usually supported by state-of-the-art DSP cores. Pre-and post operations are supported without additional clock cycles. Bit reversal addressing allows efficient implementation of FFT algorithms. The size of the modulo buffer is programmable and the start address has to be aligned. The address registers are structured orthogonally which means each can be used for both AGUs.

No configuration or mode registers are supported because all functions are coded inside the instruction word. Core status flags like zero or sign flags are assigned to the destination registers. The flags are used for predicated execution which reduces branch instructions in control code sections (if-then-else) without limitations and restrictions for instruction scheduling and register allocation.

In the publication part of the thesis the main architectural features are introduced in detail.

51

4 High Level Language Compiler Issues This section covers the compiler aspects considered during definition of xDSPcore starting with an introduction of coding practices used for implementing algorithms on DSP core architectures. The second part gives a brief overview of the structure used for high-level language compilers followed by a discussion about architectural requirements for implementing an optimizing compiler. The second part ends with a short summary about why xDSPcore can be called compiler friendly architecture.

4.1 Coding Practices in DSP�s Traditional DSP algorithms like filtering are loop-centric, which means that 90% of the execution cycles are executed in code sections consuming less than 10% of the application code. Increasing the usage of core resources used in loop constructs leads to a significantly decreased number of required execution cycles.

This section introduces coding practices in Digital Signal Processors for increasing ILP in VLIW architectures which leads to a better performance during execution of loop constructs. The first part covers software pipelining which reduces the number of execution cycles and increases the usage of core resources. The limitations and restrictions of software pipelining are investigated as can be reviewed in [78].

The second part introduces loop unrolling which is often used in combination with software pipelining. Different to software pipelining loop unrolling is used to increase the work carried out inside the loop kernel. At the end of the section the specific implementation of predicated execution for xDSPcore is introduced which can also be also used for increasing the performance of loops. In [23][46][49][71][123][136] and [140] aspects for implementing C-code for efficient code generation are mentioned and illustrate which indicates the importance of high-level language programming of Digital Signal processors.

4.1.1 Software Pipelining Software pipelining tries to invoke the next loop iteration as early as possible resulting in overlapped execution. The example in Figure 39 illustrates functionality of software pipelining on a loop body with four instructions A, B, C and D. In the first row A is executed for the first time (A1). In the second row when B is executed for the first time (B1) the next loop iteration is initiated in parallel (A2).

52

Figure 39: Principle of Software Pipelining.

In row four of the example in Figure 39 four instructions are executed in parallel (D1, C2, B3, and A4) and the usage of resources reaches the maximum value. Row four is called kernel, the rows before it used for filling the pipeline prolog and the execution cycles for flushing the pipeline epilog.

Software pipelines face similar limitations as hardware pipelines e.g. data dependencies between instructions of the loop. Some of these limitations are considered in the next sub-sections.

Trip Count

The trip count is equal to the number of loop iterations. If the loop iteration count equals trip count the loop is terminated. A minimum number of loop iterations are required to fill the software pipeline. The software pipeline does not increase system performance when the loop iteration count is less than the trip count. For the example in Figure 39 the trip count is four, which requires minimum loop iteration equal to four to make use of the advantages of software pipelining.

Minimum Initiation Interval

The Minimum Initiation Interval (MII) is equal to the minimum number of execution bundles building up a software pipelined loop kernel. The MII is restricted by data dependencies which are introduced later as live too long problem and by the number of available architectural resources.

Modulo Iteration Interval Scheduling

Modulo iteration interval scheduling provides a methodology to keep track of resources that are a modulo iteration interval away from others. For example for a two-cycle loop instructions scheduled on cycle n cannot use the resources as instructions scheduled on cycle n+2, n+4, � .

The xDSPcore architecture supports the execution of five instructions in parallel, two load/store, two arithmetic and one program flow instruction. In Figure 40 the Data Flow Graph (DFG) of a small kernel is illustrated. The sum of two values first loaded from data

53

memory is calculated and the result then stored in data memory again. The memory addresses are auto-incremented.

Figure 40: Data Flow Graph of an Example Issuing Summation of two Data Values.

The relative dependency between load instruction and add instruction is two which requires one cycle distance between fetching data and issuing the add operation. The dependency between ADD and Store operation is however zero, therefore both instructions can be issued during the same cycle.

Cycle Number 0 2 4

MOV1 MOV2 CMP1 CMP2

BR

LD (R0)+, D0LD (R1)+, D1

ADD D0,D1,D4

�

Cycle Number 1 3 5

MOV1 MOV2 CMP1 CMP2

BR

ST D4, (R2)+ �

Table 1: Principle of Resource Allocation Table.

In this example three load/store instructions are executed which requires a minimum kernel length of two execution bundles. A resource allocation table as used in Table 1 is useful for manually performing software pipelining. The two load instructions are scheduled into cycle 0. Data dependency leads to an unused second cycle. The third cycle (cycle 2) is used to invoke the add operation. As mentioned above it is possible to invoke the store operation

54

in the same cycle as the ADD operation. To prevent resource conflicts during software pipelining the store operation is shifted to the fourth cycle (cycle 3) which has no influence on the overall cycle count already limited by an MII of two caused by the required three load/store instructions.

In Table 2 software pipelining is manually introduced for that example. The first column is copied into the second column and the second into the third.

Cycle Number 0 2 4

MOV1 MOV2 CMP1 CMP2

BR

LD (R0)+, D0LD (R1)+, D1

LD (R0)+, D0 LD (R1)+, D1

ADD D0,D1,D4

ADD D0,D1,D4

Cycle Number 1 3 5

MOV1 MOV2 CMP1 CMP2

BR

ST D4, (R2)+ ST D4, (R2)+

Table 2: Resource Allocation Table including Software Pipeline Technology for increased Usage of Core Resources.

In Table 2 it becomes apparent why the move of the store operation into the fourth cycle (cycle 3) carried out to prevent data hazards has no influence on the overall performance. The second column in Table 2 (grey shaded) is equal to the kernel as illustrated in Figure 41. Column one is equal to the prolog but instead of using a NOP instruction it is possible to schedule the loop instantiation into the free execution cycle (which increases code density). Column three is equal to the epilog of the software pipelined loop.

Prolog: LD (R0)+,D0 || LD (R1)+, D1 BKREP N-1, epilog

Kernel: LD (R0)+,D0 || LD (R1)+,D1 || ADD D0, D1, D4 ST D4,(R2)+

Epilog: ADD D0, D1, D4 || ST D4,(R2)+

Figure 41: Example for Assembler Code Implementation including Software Pipelining (xDSPcore).

Live Too Long Problem

An additional limitation is the live too long problem. For example a loop kernel consists of two execution cycles. It is not possible to use a register entry for more than two cycles because the next loop iteration would overwrite the value before it has been used. The two

55

aspects influencing the live too long problem are the loop carry path (LCP) and the split join path (SJP).

To illustrate the related limitations another code example as in the subsection before is necessary. Figure 42 introduces another code example namely the implementation of a search function, where the maximum value of a vector has to be found.

Figure 42: Data Flow Graph for Maximum Search Example.

Loop Carry Path

A loop carry path is caused by an instruction writing a result whose value is used for the next loop iteration. For the example in Figure 42 the LCP is between the CMP function responsible for comparing the current maximum value with the new loaded value and the conditionally executed move register function which updates the latest maximum entry.

In the example Figure 42 the LCP is equal to two and restricts the MII to the value of two.

Split Join Path

If the same value is used by more than one instruction it has to be valid until all instructions have been executed. The longest path determines the minimum length of the MII still guaranteeing correct semantics.

For the example in Figure 42 the SJP is between the load instruction and the conditional move register instruction and is equal to three. Therefore the minimum number of execution cycles building up a valid loop kernel is equal to three.

For the example in Figure 42 the SJP dominates the MII. In some code sections it is possible to reduce the LCP by moving the instruction overwriting the value to a later

56

execution cycle. Adding additional register move instructions can be used to split up the SJP in smaller parts and therefore to reduce the MII.

4.1.2 Compiler Support The C-compiler for xDSPcore supports software pipelining. Different to commercially available C-compilers the use of compiler-known functions and intrinsics is not required. In Figure 43 a small C-code example is illustrated, calculating 16 dot products and summing up the results in an accumulator register.

for (j=0; j<16, j++) sum += a[j]*b[j]

Figure 43: C-Code Example for Illustration of Software Pipelining.

Figure 44 illustrates the code generated by the xDSPcore C-compiler without making use of software pipelining, which requires 52 execution cycles to calculate the result. The loop kernel consists of two load instructions (loading data from memory) and a MAC instruction which executes the multiplication and the summation in one instruction.

CLR A0 || BKREP 15, loopend LD (R0)+, D2 || LD (R1)+, D3 NOP FMAC D2, D3, A0

loopend: RET NOP NOP

Figure 44: Generated Assembler Code without Software Pipelining (xDSPcore).

The NOP is necessary due to data dependency (load-in-use dependency). The two NOP instructions after return (RET) are branch delay NOPs caused by the five-stage pipeline of xDSPcore. Making use of software pipelining the loop kernel can be reduced to one execution bundle and the number of cycles used to calculate the result is decreased to 19 execution cycles (illustrated in Figure 45).

LD (R0)+,D2 || LD (R1)+,D3 || CLR A0 LD (R0)+,D2 || LD (R1)+,D3 || REP 13 FMAC D2, D3, A0 || LD (R0)+,D2 || LD (R1)+,D3 FMAC D2, D3, A0 || RET FMAC D2, D3, A0 NOP

Figure 45: Generated Assembler Code including Software Pipelining (xDSPcore).

One of the two branch delays can be filled with an instruction of the epilog and only one NOP instruction remains. This remaining NOP instruction can be removed by making use

57

of a non-delayed RET instruction. Another possibility is to use a delayed RET instruction and to make use of predicated execution. For this purpose the predicated execution implementation of xDSPcore supports loop conditions. Further details can be found in a later section.

The software pipelining algorithm is based on modulo scheduling [134] which estimates the MII. The instruction scheduler tries to find a solution fulfilling the estimation value. If no valid solution is found, the MII is increased until a solution can be obtained or the number of necessary execution cycles exceeds the original scheduled loop.

4.1.3 Loop Unrolling Loop unrolling is quite often used in combination with software pipelining. If dependencies or SJP and LCP limit the performance increase by software pipelining loop unrolling can decrease the number of execution bundles.

In Figure 46 the basic functionality is illustrated. The loop has to be executed for N times. If the number of available resources is sufficient and several loop bodies can be implemented in parallel it is possible to reduce the number of loop iterations, for example by executing two loop bodies in parallel to reduce the loop iterations by a factor of two.

Figure 46: Principle of Loop Unrolling.

For illustrating loop unrolling the summation of vector elements example is used and the related DFG has been illustrated in Figure 40. Software pipelining is limited by the required three load/store instructions and therefore a MII of two. If the elements are split in two halves (even and odd ) and the two kernel functions executed in parallel an increase in the kernel by one cycle will occur but reduces the necessary iteration count to half of the loop count. This leads to a reduction of necessary execution cycles by about 25%. The drawback is a decreased code density due to the duplicated kernel, however the contribution of loop kernels to the application code density is negligible.

58

4.1.4 Predicated Execution using Loop Flags xDSPcore supports a predicated execution implementation, which allows loop flags to be used to indicate the status of the loop iteration for building up the condition. Therefore it is possible to move instructions from epilog or prolog into the loop kernel and execute them only once (e.g. during first or last loop iteration). In Figure 47 an example is illustrated making use of this feature. The first implementation shows a standard loop implementation and the second illustrates the advantage of using predicated execution.

Prolog: inst1 || inst2 || inst3 inst4 || inst5 || BKREP N, loopend Kernel: inst1 || inst2 || inst3 inst4 || inst5 || inst6 Epilog: loopend: inst6

BKREP N, loopend Loop: inst1 || inst2 || inst3 FSEL sf0=0: dc:inst4 || dc: inst5 || true: inst6 loopend: inst6

Figure 47: Principle of Predicated Execution using Loop Flags.

The drawback of predicated execution is the decreased code density caused by condition coding. For the implementation example illustrated in Figure 47 the code density may even be increased. Different to the example in Figure 47 where parts of the prolog are shifted into the loop kernel it is also possible to move the epilog into the loop kernel. For example moving a return operation into the loop kernel which is then executed only once can be used to compensate the drawback of branch delays. More details can be found in [P6][P7].

59

4.2 Compiler Overview The compiler of xDSPcore can be split into two parts namely an architecture-independent front-end and an architecture-dependent back-end [43].

The front-end performs lexical, syntax and semantic analysis of the source code and architecture-independent code optimizations are carried out in the front-end part. An intermediate representation (IR) for example the IR of the Open Compiler Environment of ATAIR is built up and used to transfer the parsed and pre-optimized application code to the back-end.

The back-end transforms the IR to the processor specific programming language. The back-end considers the available processor resources and performs architecture-dependent optimizations.

Figure 48: General High-level Language Compiler Structure.

The same front-end (as illustrated in Figure 48) can be used for different processor architectures. The implied optimizations are architecture independent whereas each input language requires a modified front-end. The back-end contains architecture-dependent optimizations and is therefore bound to the target architectures. The same back-end can be used in combination with different front-ends and therefore different input languages as long as IR is the same.

The OCE from ATAIR is used as the front-end of the xDSPcore C-compiler with some minor modifications. The back-end was developed in cooperation between Infineon Technologies and the Christian Doppler Gesellschaft (CDG) in Vienna, Austria. Some implementation details of the back-end are illustrated in the next subsection and more details can be found in [98][163].

The methods introduced in the following pages are well known and have been used in various compiler backends [43][46][114]. Some of these aspects are important when discussing the suitability of architectural core features and therefore shall be briefly introduced [99][105].

60

Instruction Selection

Instruction selection takes the abstract syntax tree provided by the frontend as input and generates machine language instructions or low level intermediate instructions. The most favored approach is bottom up tree rewriting [86]. A set of rules specifies the tree patterns which can be matched, the semantic actions for these patterns like machine instructions and the related costs e.g. the number of cycles required to execute the machine instructions. Typeform symbols determine the location and representation of the values at run time for example address register, immediate constant and so on. They appear as operands and results of the rules.

During the first recursive traversal of the tree all nodes are labeled in bottom-up order following the rules according the patterns matched by this and its descendant nodes. This labeled information includes the typeform of the result and the minimum cost to achieve this result. The generated information is used to determine the applicable rules for ancestor nodes and the cumulated costs for ancestor sub-trees. In a second pass the tree is traversed recursively considering the rules evaluated in the first pass.

Instruction Scheduling

During instruction scheduling the logical instruction stream is reordered without altering the code semantics. The aim of this reordering is to reduce the number of execution cycles. For a multi-issue architecture instructions which can be executed in parallel can be identified. Unused execution cycles caused by data and control dependencies (for example branch delay slots) can be filled with independent instructions.

One major issue concerning instruction scheduling is the phase ordering problem between instruction scheduling and register allocation. Instruction scheduling performed before register allocation (named prepass scheduling) offers maximum flexibility but can lead to high register pressure, causing a need for additional instructions for saving and restoring register entries (spilling). Performing register allocation before instruction scheduling can introduce false dependencies between instructions which reduce possible instruction level parallelism (ILP). There are some strategies known to lessen this problem for example by obeying register pressure during scheduling or bottom up reorganization of scheduled code in order to reduce register pressure [63][90].

Register Allocation

Early passes in the compiler assume an unlimited number of registers - symbolic registers. During register allocation the mapping of these numerous symbolic registers to the physically available CPU registers takes place.

The standard solution is the graph-coloring method introduced by Chaitin [60] and refined by Briggs [54][55]. This method examines data and control flow and builds an interference

61

graph. The nodes of this graph represent the symbolic registers. Whenever two nodes are alive at the same time an edge between them has to be added to the graph. Two nodes connected by an edge cannot use the same CPU register. The graph is tried to be colored with N colors, where N is the number of available CPU registers [100][142][147]. If the mapping cannot be completely performed some of the symbolic registers have to be stored in memory. This technique is called spilling. Spilling causes additional instructions and additional execution cycles and therefore choosing the right nodes for spilling is an important issue to keep the spilling costs low [66][130][152].

A second task of register allocation is the elimination of copy instructions, called register coalescing [87]. Earlier passes for example Static Single Assignment (SSA) optimization [46][68][69] may introduce move instructions which are redundant if source and destination of the move do not interfere. The two nodes representing source and destination operands are coalesced to one node which increases code density and decreases the number of execution cycles.

62

4.3 Requirements The requirement section is split into two parts and will begin by describing the requirements a programmer expects from a C-compiler followed an introduction of architectural features which support the development of an optimizing HLL compiler..

4.3.1 Requirements onto C-Compiler State-of-the-art C-compilers have to support the ANSI-C standard. For efficient use of DSP specific functions like saturation and special rounding modes the support of the Embedded-C standard is required. Compilers supporting C++ or other object oriented languages are built up in the same manner as an ANSI-C compiler, with a modified front-end as illustrated in Figure 48. The question is if the support of an object oriented language for describing digital signal processing algorithms is required.

A state-of-the-art C-compiler has to support fractional data types whereas ANSI-C is based on integer data types. This can be done by type conversion through destination variables of the new type. Multiple data memory banks are another feature of DSP core architectures which has to be supported by the compiler. The challenge of multiple memory banks for the compiler is the variable assignment. If it is not possible to split the variables to the two data ports only half of the memory bandwidth can be used. Methods for interleaved address assignment can be found in [151].

HLL compilers for DSP cores produce significant code overhead compared with manually generated code. Some of the reasons for the reduced code density will be discussed in the following subsections. Some reasons for the lack of optimizing C-compiler for DSP architectures are due to the architecture, however commercial aspects also should be considered.

The commercial pressure to provide optimizing compiler technology has not been overly strong. The application code for high volume products is manually coded and development costs are negligible compared with the costs of poorly used silicon caused by low code density. Most DSP core vendors favor a third-party tool concept. Companies like Green Hills, Tasking and also Metrowerks have experience in the area of tools for microcontrollers. To obtain efficient code for a DSP core additional knowledge and effort is required but the possible market to sell licenses is small, especially for embedded DSP cores. The third party tool provider cannot expect to sell thousands of licenses each year.

4.3.2 Architectural Requirements The concept of defining a DSP core architecture driven by algorithmic and hardware development aspects and afterwards developing a tool-chain has lead to the status quo of C-compilers which generate significant code overhead compared with manual coding. These

63

are not suitable for high volume products. The second limitation is the large amount of manually developed assembly code (legacy code) which limits the efficiency of products and system solutions.

Load Store Architecture

Load/Store architectures support separate instructions for transferring data entries of the register files to data memory and vice versa. Arithmetic instructions fetch their operands from the register files. The decoupling of arithmetic and data move instructions allows the use of lean pipeline structures and the reduction of the native instruction word size.

The use of software pipelining and loop unrolling can be simply introduced by decoupled arithmetic and move data operations. An additional advantage during execution of control code is that not each instruction requires data transfer to data memory and intermediate results can be stored inside the register files.

Large Uniform Register Sets

Register files significantly influence the die area of the core. The internal registers are used to store intermediate values which reduces the number of data memory accesses. During instruction scheduling an infinite number of registers is assumed. During register allocation the used variables are mapped to the physically available register sets.

Figure 49: Example for banked Register Files (TI C62x).

The requirement of compiler architects to support a large number of registers is neither economic nor feasible from a hardware point of view. The requirement of supporting one large register file influences reachable clock frequency of the subsystem due to multiplexer, address decoder logic and wiring effort caused by the increased number of read/write ports and decreased code density due to the required addressing space. The TI C62x provides a large register file (32 times 32-bit wide registers) and the same registers are used for arithmetic and addressing modes (as illustrated in Figure 49). This is a compiler friendly aspect. The fact of missing accumulator support (e.g. 40-bit as common in most of today�s core architectures) leads to waste of two 32-bit registers for supporting 40-bit intermediate

64

results. To reach a higher clock frequency the register file is banked with a cross connection used to transfer data between the two banks.

Banked register files restrict instruction scheduling. In this case instructions which use results of prior instructions have to be executed on the same data path. The register allocator of the compiler is limited to the use for the registers of one bank for consecutive instructions. Unused registers in the second bank cannot be used. The transfer of data entries between two register banks costs additional clock cycles resulting in an increased define-in-use dependency.

No Modes

Mode registers are used to increase code density. The same instruction coding takes on a different meaning depending on the chosen mode. This is often used for DSP specific functions like saturation and rounding modes.

Figure 50: Limitations during Instruction Scheduling caused by Processor Modes.

In Figure 50 two small code sections are illustrated where the chosen assembly code represents xDSPcore assembler. For the first code block the saturate mode is activated and for the second code block it is not. Saturation influences the result of arithmetic instructions when the result of the operation exceeds the data range of the destination register.

Moving instructions from one code block into the next where a different mode is set is expensive in the terms of code density and cycle count. The mode has to be changed before executing the moved instruction and reset afterwards. In the example Figure 50 moving of the add instruction is not possible due to the different mode settings. The same mechanism as illustrated in Figure 50 is true for instructions of different basic blocks with different mode settings where basic blocks are code sections between branch instructions.

The advantage of the increased code density by introducing mode bits can be over-compensated by the required instructions for setting and resetting modes. The alternative is

65

a bad usage of the provided core resources. Mode settings have a similar influence on code fragmentation as branch instructions.

Orthogonal Instruction Set

In the section where DSP specific aspects are discussed orthogonality is defined. An example for increasing theoretically attainable performance by increased clock frequency by omitting orthogonality is illustrated in Figure 51. The AGU of Motorola 56k DSP family supports address, modifier and base registers. Each of the AGUs has a group of registers assigned, which means that not every address register can be used for each address operation.

Figure 51: Example for Address Generation Unit (Motorola 56300).

This restricts the use of the address registers and the AGUs during register allocation and instruction scheduling which can lead to a significant decrease of the core performance.

In Figure 52 a second example of not orthogonal instructions is illustrated. The MAX2VIT D2,D4 instruction of the Starcore SC140 is introduced to increase performance when calculating Viterbi algorithms [26].

Figure 52: Example for not Orthogonal Instructions: MAX2VIT D4,D2 (Starcore SC140).

66

The instruction can only make use of two predefined data registers, where one of the two data registers is also used as the destination register. Even worse a mode bit is used to switch between these two data registers and a second register couple. The implicit addressed register entries do not provide flexibility to the register allocator and even restricting the use of the register couple in code sections making use of the MAX2VIT instruction.

Simple Issue Rules

Dependencies between instructions increase complexity during instruction scheduling. Deep pipelines with implicit dependencies increase complexity of instruction scheduling when considering control dependencies. Data value dependent execution time as supported by ARM [6] for the multiplication operation can be used as an example. Depending on the data value of the multiplication operands the execution time can differ between 1 and 4 clock cycles. The related define-in-use dependency has to be considered as worst-case during compile time which leads to unused execution cycles.

Hidden cluster latency caused for example by banked registers as mentioned in Figure 49 for the Texas Instruments TI C62x, increases issue complexity. The relationship between register bank and datapath reduces flexibility during instruction scheduling.

Efficient Stack Frame Access

Compilers communicate arguments to subroutines by using a stack. One of the address registers is used as stack frame pointer and the called subroutines do not have to know the absolute address where the arguments are located. The bypassed data values are transferred relative to the stack frame pointer. For this reason efficient stack frame addressing is important when using a compiler for automatic code generation.

4.3.3 Architectural Obstacles Some other examples of architectural obstacles used in commercially available core architectures to increase code density and reduce consumed silicon area are illustrated in the next subsections. The drawbacks of these architectural features are mentioned.

Modes for Different Instruction Sets

To increase code density an efficient binary coding of the instruction set architecture is required. In the case of microcontroller architectures code density is not so important because often the code is stored in external and therefore cheap memory.

67

Figure 53: Example for Mode Dependent Instruction Sets: ARM Thumb Decompression Logic.

In mid 1990s ARM introduced the thumb instruction set architecture [3] to increase code density. Thumb is a condensed version of the ARM instruction set with reduced support of operands along immediate and offset values. A decompression logic as illustrated in Figure 53 is used to internally build up regular ARM instructions requiring additional hardware.

For example the multiply instruction supports 4 bits for operand coding in a regular ARM instruction and 3 bits for a thumb multiplication. Thumb can be used to increase code density but it has the drawback of irregularity and mode dependency. Instruction selection has a significant impact on the output of register allocation. The compiler has to take conservative assumptions when using the reduced register set for Thumb instructions. Making use of the reduced register set increases register pressure and produces more spill code. Additional move instructions are required which decrease code density.

Irregular Instructions

The Starcore SC140 [32] supports up to 16 data registers as illustrated in Figure 54. The higher 8 data registers are also used as base registers during modulo addressing modes which limits their use during the code sections not using modulo addressing modes.

68

Figure 54: Example for Address Generation Unit (Starcore SC140).

Conservative assumptions by the compiler prevent the use of the higher eight address register. An instruction that makes use of the higher data registers must not be moved into a code section where modulo addressing is used.

Implicit Dependencies

The implementation of predicated execution in the Starcore SC140 [32] is an example of implicit dependencies. The Starcore is taken as an example but the problem is similar in most of today�s core architectures.

Predicated execution or conditional execution can be used to reduce the number of conditional branch instructions in control code sections typically caused by if-then-else constructs [149]. The drawback of branch delays caused by branch instructions can be compensated with conditional execution. In most of the available implementations the disadvantage is decreased code density. A more detailed analysis can be found in [149].

The implementation for the Starcore supports one status flag called the T-Flag which is used to build up the condition. During instruction scheduling no instruction which influences the T-Flag is allowed to be scheduled between status generation and evaluation, which significantly limits the instruction scheduler. Multi-way VLIW which supports the execution of several instructions in parallel (e.g. at the Starcore up to 6 instructions) results in low branch distances. This limitation can have a significant influence on the execution time of the application code.

69

Complex Instruction Sets

In the first section the main differences between microcontroller and DSP architectures were discussed. One major aspect to increase code density is the support of complex instructions like the MAC instruction with implicit operand fetch from memory.

Complex instruction set architectures are not suitable for use by automatic compiling tools for example the instruction set architecture of Carmel DSP. To overcome the drawback of low code density of VLIW architectures CLIW [12][157] is introduced (illustrated in Figure 55). An extended program memory port supports the fetching of instructions during the moments of required peak performance of the DSP core.

To make use of the extension port requires a special instruction coding and some of the instructions are only supported in CLIW. Both aspects limit the reachable performance of the instruction scheduler.

Figure 55: Configurable Long Instruction Word (CLIW of Carmel DSP Core).

RISC instructions allow the use of the available parallel resources and coding practices as illustrated in an earlier part of this section aim to increase code density and decrease the required execution cycles.

4.4 HLL-Compiler Friendly Core Architecture This sub-section briefly summarizes the aspects concerning why xDSPcore is suitable as a target architecture for an optimizing C-compiler. Many of the aspects have already been illustrated in the subsections before and their individual implementation for xDSPcore is illustrated in the following subsections.

Load-Store Architecture

xDSPcore is based on a modified Dual-Harvard load-store architecture. Separate instructions are available for transferring data values between register file and data memory. A brief overview is illustrated in Figure 56. Program and data are assigned to different address spaces and an instruction buffer with cache logic [92] is used to increase code

70

density and decrease power dissipation by reducing the switching activity at the program memory port [P2][P3][P4][P5].

Figure 56: xDSPcore Core Overview.

Orthogonal Register Set

The register set of xDSPcore is split into three parts; a data register file, an address register file and a branch file (illustrated in Figure 57). The registers inside each of the register files are orthogonal, while none of the registers are assigned to a certain instruction.

Figure 57: Orthogonal Register File.

The first implementation will combine data and address register file in one register file structure which increases the number of read and write ports for one register file but reduces the read and write ports when considering the register file as one part.

No Mode Register

xDSPcore does not contain mode registers. Supported features and instructions are coded inside the instruction word. The status bits indicating the core status of xDSPcore like sign, zero or overflow flags are destination register related which increases flexibility for the instruction scheduler. The flags are stored in a separate branch file as illustrated in Figure 57.

71

Orthogonal Instruction Set

The instruction set is orthogonal in the sense of the definition by BDTi [116]. None of the instructions contain implicit operand addressing or micro architectural limitations in the sense of mode dependent resource allocation, examples of which can be found in [3].

Simple Issue Rules

The lean RISC pipeline structure as introduced in Figure 58 allows short define-in-use and load-in-use dependencies. The supported instructions and their dependencies inside the pipeline can be summarized to five different cases. In Figure 58 the issuing rules are summarized and illustrated.

Figure 58: Issuing Rules for xDSPcore Architecture.

The n-way VLIW architecture allows the execution of n instructions in parallel. The first implementation supports 5 parallel units. The decoder structure is split into separate units so removing or adding additional units has only minor influence on the decoder architecture.

Efficient Stack Frame Addressing

Addressing modes common in DSP architectures are supported and index addressing is frequently used by compilers for subroutine data exchange. The support of short addressing modes allows an increase in code density.

Examples

For verification of the aspects introduced in section 4.3 small application code examples are used where the results in this section are generated using the C-Compiler. These comparisons are not made to illustrate the advantage of the proposed DSP core compared with existing solutions. The figures illustrate the influence on the outcome of HLL-compiler when considering the aspects introduced in section 4.3.2 and 4.3.3.

The first example is control code (parts of the Dhrystone benchmark suite [18]) where a 32-bit microcontroller [40] and a 16-bit DSP core [10] have been chosen for comparison. The results focus on code density. C-compilers based on the same compiler technology from

72

ATAIR are used for a comparison between the two DSP cores. The C-compiler for xDSPcore is a prototype C-compiler where several optimizations are not been included yet. In Figure 59 the memory footprints normalized in bytes are illustrated. The ISA of xDSPcore can be scaled in an application-specific way to increase code density which has not been done for this comparison but the standard 20-bit native instruction word has been used.

Figure 59: Results for Dhrystone Benchmarks generated by C-Comipler.

xLIW as seen in [P2], the scalable long instruction word concept, lean pipeline structure and predicated execution [P6] allow similar code density as that of microcontroller architectures. The second code example (parts of the enhanced full rate (EFR) speech codec algorithm) compares the required code density for the two DSP cores.

Figure 60: Results for EFR Benchmarks generated by C-Compiler.

The results illustrated in Figure 60 show a code density improvement of approximately 50%. The improvement has been achieved by considering the aspects in section 4.3 as both compilers are based on the same technology.

Dhrystone

0

500

1000

1500

2000

2500

3000

3500

Dhrystone 1 Dhrystone 2

prog

ram

mem

ory

(byt

e)

TricoreCarmelxDSPcore

EFR Encoder

05000

1000015000200002500030000350004000045000

EFR Encoder

prog

ram

mem

ory

(byt

e)

Carmel

xDSPcore

73

5 Summary of Publications This chapter summarizes the publications included in Part II of the thesis. The publications can be split into two parts, the first part consists of the publications covering the architectural features of the xDSPcore architecture and the second contains the publications introducing DSPxPlore which is a design exploration methodology for RISC based core architectures.

5.1 Architectural Aspects of Scalable DSP Core In Figure 61 the main architectural features of xDSPcore are illustrated and the publications numbered with [P1], [P2] � being assigned to the relevant architectural blocks.

Figure 61: xDSPcore Overview.

Publication [P1]: xDSPcore � a Configurable DSP Core. This publication provides an overview of the xDSPcore architecture introduces the main architectural features and briefly illustrates the concept of DSPxPlore, the design space exploration methodology. xDSPcore is a RISC DSP core considering the development of an optimizing high-level language compiler already during architecture definition. The core architecture introduced in this thesis is the outcome of the research done in collaboration between Infineon Technologies Austria, Vienna University of Technology and Tampere University of Technology. To obtain the power and area consumption requirements of SoC applications the main architectural core features can be parameterized to acquire application-specific implementations. DSPxPlore is used to analyze the requirements of the application code to adapt the core configuration to meet power dissipation and area targets.

74

Publication [P2]: xLIW � a Scaleable Long Instruction Word. A main architectural feature of xDSPcore is the architecture of the program memory port. The core is based on VLIW programming model making use of VLES to increase code density and reduce the size of the program memory port by utilizing an instruction buffer architecture. This publication illustrates the problem that VLIW exhibits poor code density. To overcome this drawback xLIW (a scalable long instruction word) is introduced. The main architectural blocks like the instruction buffer for implementing the features of xLIW are illustrated. The possibility to minimize the worst case execution time strongly influences the chosen structure.

Publication [P3]: Align Unit for a Configurable DSP Core. The Align Unit is the central part of the xDSPcore program memory port. It reassembles the execution bundles during run-time. The Align Unit contains an instruction buffer for compensating the memory bandwidth problem caused by the reduced program memory port during peak performance of the core architecture. This publication introduces architectural details of the Align Unit including the instruction buffer management. The alignment process building up the execution bundles is illustrated in detail including an analysis concerning limitations and possible stall cycles. A separate section covers the behavior during loop handling and serving hardware interrupt handling which has significant influence on the buffer management.

Publication [P4]: A Scaleable Instruction Buffer for a Configurable DSP Core. The main topic of this publication is implementation details of the Instruction Buffer of the Align Unit. The Align Unit is used to compensate the memory bandwidth mismatch between fetch bundle and worst case execution bundle and the instruction buffer is used to reduce power dissipation during execution of loop constructs by reducing the number of program memory accesses. The reduced switching activity at the program memory port reduces dynamic power dissipation as illustrated in [P5]. The loop body is fetched once and then executed from the buffer. To obtain a balanced relation between buffer size (causing an increased silicon consumption and available storage space to keep instructions) the number of entries is parameterized. A regular structure as the instruction buffer of the Align Unit is suited to be implemented in manually optimized full-custom design. The DPG of RWTH Aachen is used to implement regular parts of the instruction buffer. This methodology makes use of the advantages of manual full-custom design like increased performance and decreased power dissipation [88][167] and satisfies the requirements of scalability for xDSPcore. Besides architectural and implementation details the publication illustrates the influence of different buffer configurations onto the core area.

Publication [P5]: A Scaleable Instruction Buffer and Align Unit for xDSPcore. The publication is an extended paper of publication [P4]. The concept and advantages of xLIW are summarized and benchmarks highlight the relevance of using an instruction buffer during handling of loop-centric algorithms which can be found in typical DSP algorithms.

75

The influence on the switching at the program memory ports onto the overall dynamic power consumption is illustrated. The publication also contains a short overview of the DPG of RWTH Aachen, the chosen full-custom methodology used for implementing the instruction buffer.

Publication [P6]: FSEL � Selective Predicated Execution for a Configurable DSP Core. Increasing application complexity leads to a shift in the system partitioning between DSP cores and microcontroller architectures. DSP cores also have to handle control code sections efficiently. Typically control code contains if-then-else constructs for implementing decision paths. The drawbacks of branch instructions are branch delays (unused clock cycles caused by the break in the program flow) which decrease the practical performance of the core architecture. The number of branch delays can be reduced by minimizing the number of conditional branch instructions and by introducing predicated or conditional execution. The publication introduces FSEL, the predicated execution implementation for xDSPcore. FSEL allows a reduction in the number of branch instructions without decreasing code density. This decrease in code density is the major drawback of available implementations. Benchmark results illustrate the advantages of FSEL. This implementation (destination register related flags) allows an efficient use by a high-level language compiler

Publication [P7]: A Branch File for a Configurable DSP Core. For cycle-efficient implementation of control code xDSPcore supports exhaustive predicated execution and a rich set of delayed and non delayed conditional branch instructions. For both features hardware flags indicating the status of the core architecture are required. The publication introduces the concept of static and dynamic flags and illustrates the structure of the branch file which is used for storing the status information. The separate register file is used to relax the regular register files concerning read/write ports which are already stressed due to orthogonality requirements.

Publication [P8]: A Scaleable Shadow Stack for a Configurable DSP Concept. The performance of core architectures can be improved by increasing the number of pipeline stages. xDSPcore supports a 3-phase RISC pipeline and the first implementation uses 5 clock cycles for mapping the three phases. The execution phase is split into two execution cycles. This relaxes timing during executing of MAC instructions (including write back into the register file). During handling of interrupt service routines a data consistency problem can take place. A known solution for this problem is adding shadow registers for storing of intermediate results. This publication introduces the shadow stack, taking care of this data consistency problem without requiring any MIPS or instruction words. Benchmarks in the publication illustrate the advantage especially when supporting nesting interrupts where the shadow stack reduces the required silicon area and provides additional flexibility.

76

Publication [P9]: xICU � a Scaleable Interrupt Unit for a Configurable DSP Core. The changing requirements in DSP core architectures need to handle interrupts more efficiently as was common for early DSP architectures. In this publication some of the features commonly used are illustrated and compared with the features of interrupt control units (ICUs) used for microcontroller architectures. Prioritizing interrupt sources is supported by most of the ICUs where xDSPcore additionally supports a feature called priority morphing. The interrupt priority of an interrupt source can be changed during run-time. This can be done by explicitly programming the priority but also automatically by time which allows an increase or decrease in priority in dependency of the number of clock cycles. This feature cannot be used during handling of real-time critical code segments but can be used by operating systems (OS) and hardware schedulers to change program flow during run time.

5.2 Design Space Exploration This subsection introduces some publications concerning the design space exploration namely DSPxPlore. xDSPcore is a core architecture where main architectural features can be modified according to the application. To understand the requirements of the application code in an early phase of a project the analysis possibilities of DSPxPlore can be used. The methodology is briefly introduced in Figure 62. At the end of this subsection the topic of validation is briefly covered.

Figure 62: DSPxPlore Overview.

Publication [P10]: DSPxPlore � Design Space Exploration for a Configurable DSP Core. This publication is used to give an overview of the concept for design space exploration, the tools used and parameters to quantify architectural changes resulting in area consumption and necessary cycle count. Examples for static and dynamic results are included.

Publication [P11]: Design Space Exploration for an Embedded DSP Core. The publication illustrates further development of the design space exploration methodology. Besides a brief core overview focusing on the parameters which have a significant influence on the core performance the design space for RISC based core architectures is discussed. The

77

parameters generated during static and dynamic analysis are outlined and results on a set of benchmark programs illustrate the potential of DSPxPlore.

Publication [P12]: An Automatic Decoder Generator for a Scaleable DSP Architecture.

The current core configuration is stored in an XML based configuration file, which is used to keep the tools, the hardware description and the documentation consistent. As an example for the hardware description this publication introduces a decoder generator tool which provides the hardware decoder for the xDSPcore in VHDL-RTL [48][64]. For this publication the configuration has been stored in a file of xls-format, where the latest update of the decoder generator already uses the common XML based configuration file.

5.3 Author�s Contribution to Published Work In this section the author�s contribution to the afore-mentioned publications is pointed out. The author is the primary author in most of the publications under agreement of the co-authors. None of the publications have been before used as part of an academic thesis or dissertation. The publications have been written by the author and Prof. Jari Nurmi who contributed his guidance through the shelves of the published work and took care to polish and reduce the number of errors and Germanism to an acceptable level.

Publication [P1]: The DSP Core introduced in the publication has been developed in collaboration between Infineon Technologies Austria, Vienna University of Technology and Tampere University of Technology. The core architecture has been defined by the author and topics concerning the compiler friendliness have been contributed by Prof. Andreas Krall. Dr. Reinhard Rückriem working at Infineon Technologies contributed to the architecture by pointing out weaknesses during the definition stage. Besides the architecture and documentation work the author was involved in the VHDL description of the core architecture. The name of the core architecture has been assigned by the author and the x is used to indicate the flexibility and the possibility to configure the core architecture to application specific requirements. The logo for the core architecture was designed by Marco Pertl employed at Infineon Technologies Austria.

Publication [P2]: xLIW, the scalable long instruction word of xDSPcore was defined by the author and the name for the long instruction word also.

Publication [P3]: The first implementation of the Align Unit in VHDL has been done by Raimund Leitner, under supervision of the author as part of his master thesis [122]. The first implementation used a circular buffer whereas a later version instead used a n-way cache logic for controlling the instruction buffer content. An additional contribution concerning the cache logic was carried out by Michael Bramberger, as part of his master thesis under the supervision of the author [53].

78

Publication [P4]: The manual full-custom implementation of the instruction buffer as illustrated in the publication has been done by Michael Bramberger, as his master thesis under supervision of the author and by colleagues at Infineon Technologies Austria [53].

Publication [P5]: In this publication the author was responsible for presenting the idea and the advantages of the chosen program memory port, for illustrating architectural details and benchmark results to outline the advantages. The raw data for the benchmarks was generated together with Ulrich Hirnschrott, at Vienna University of Technology. Volker Gierenz, from Catena contributed valuably to the power dissipation analysis and the introduction of the DPG.

Publication [P6]: The predicated execution implementation has been contributed by the author with contributions of Prof. Andreas Krall contributed the suitability of use of a high-level language compiler. Raimund Leitner and Gunther Laure contributed implementation details of the FSEL instructions. The work has been patented under the German patent number DE10101949C1, the author as the inventor.

Publication [P7]: The initial idea of a separate branch file was introduced by Manish Bardwaj during his stay at Infineon Technologies Austria. The branch file structure used for xDSPcore and implementation details have been introduced and published by the author. Volker Gierenz, employed at Catena Inc., contributed implementation relevant topics.

Publication [P8]: The shadow stack structure for compensating data consistency problems during interrupt handling was defined by the author. Raimund Leitner contributed valuably and partly coded a first behavioral description of the functionality in VHDL. The final implementation for xDSPcore was carried out by the author in VHDL-RTL. The benchmark analysis of the advantages compared with existing solutions was also carried out by the author.

Publication [P9]: The architecture and the feature set of the xICU, the scalable interrupt control unit for xDSPcore was defined by the author. The first implementation in VHDL-RTL was carried out by Johannes Hohl at Infineon Technologies Austria, the final VHDL-RTL coding of xICU by the author.

Publication [P10]: The initial idea for DSPxPlore came from the author, as also the main features of the development flow presented in the publication. Prof. Jari Nurmi provided the name for the design space exploration. Details concerning static analysis have been added by Ulrich Hirnschrott. Dynamic analysis results have been contributed by Gunther Laure [117] and Wolfgang Lazian [118], responsible for the development of the ISS for xDSPcore, called xSIM.

Publication [P11]: Ulrich Hirnschrott at Vienna University of Technology and the author investigated the topic of design space exploration in detail and analyzed benchmark results

79

which form part of this publication. These benchmarks are used to illustrate the features and the use of design space exploration.

Publication [P12]: The initial idea and structure of a central configuration file was introduced by the author. The main contributors to the final structure and the tags in the configuration file were Gunther Laure, Wolfgang Lazian, Stefan Farfeleder and Ulrich Hirnschrott. The idea for generating parts of the VHDL coding automatically from this description was introduced by the author in the publication. The first version of the decoder generator tool was carried out by Armin Schilke as his master thesis at CTI under the supervision of the author [145]. Taking into consideration the first implementation Ulrich Hirnschrott carried out the final version of the tool.

81

6 Conclusion This section summarizes the outcome of the research and development project, however please keep in mind that not all of the work was carried out by the author. This part is followed by a brief introduction into planned research work to increase the use of the processor resources and to reduce area and power dissipation.

6.1 Main Results This subsection provides a brief overview of the outcome of the research work and the scientific part is covered in detail by the publications in the second part of the thesis. The short summary below closes the circle at to the goals defined in the introduction.

6.1.1 Core architecture In the technical report [P1] the xDSPcore architecture is introduced. The following are the key features of the core:

• RISC, load-store architecture

• Scalable long instruction word (xLIW) including instruction buffer

• lean pipeline structure (for short pipeline implementations where even no bypass circuits are required)

• orthogonal register file

• destination register based predicated execution enabling efficient control code execution

The first implementation of the core architecture was carried out in VHDL-RTL. The first tapeout is planned in cooperation with Catena Inc. as prototype for a Digital Radio Mondale (DRM) project and will take place at the end of 2004. The estimated gate count is about 120 k gates (depending on the target frequency and on the chosen core configuration) and the frequency required for the prototype is about 120 MHz (0.13 µm CMOS technology, military worst case conditions). Further improvements will take place when taking into account the results from the first prototype.

To increase the potential performance and to reduce power dissipation regular parts of the core architecture are implemented in manual full-custom design (as introduced in [P5] but not considered for the first prototype). The influence of different buffer variants on the die area is also illustrated in this publication. The register file and the data path are regular parts and will also be implemented in full-custom design methodology making use of the locality aspect during implementation (limiting the scalability). To ease the verification flow for the manual full-custom design these parts are implemented in Verilog HDL [128].

82

6.1.2 Tools The first draft tool-chain of xDSPcore consists of a prototype C-compiler, an Assembler/Linker and a cycle-true Instruction Set Simulator (xSIM). The tool-chain was developed in cooperation with Vienna University of Technology. The lean core architecture considering the aspects introduced in section 4.3 enables the development of a prototype C-compiler for a DSP core even with the limited resources of a research project. In Figure 63 a screenshot of the current simulator environment is shown.

Figure 63: Screenshot of xSIM

83

6.1.3 Design Space Exploration The core architecture introduced in this thesis enables application specific scaling of the main architectural features like:

• data paths (number, feature set)

• register file (structure, number of entries, size of entries)

• memory bandwidth

• instruction set (ISA, binary encoding)

• pipeline (number of clock cycles)

• instruction buffer (size and number of entries)

In the publication [P10] DSPxPlore, which is a design space exploration methodology, is first introduced and the possibilities to analyze the influence of different core architectures onto the overall system performance are briefly explained. In [P11] further improvements are introduced, and some application code examples are used to illustrate the influence of changing some of the parameters.

Figure 64: DSPxPlore Design Flow

The design flow is illustrated more in detail in Figure 64. However the actual status of DSPxPlore still requires manual analysis of statistics and the understanding of the influence

84

of different core parameters onto the overall system performance. Further research is necessary to allow automatic suggestion of core configurations.

6.1.4 Validation In publication [P1] the aspect of consistency is mentioned. A scalable core architecture requires a configuration file. For the DSP core introduced in the thesis a XML based configuration file keeps the tools of the tool-chain, the hardware description and documentation consistent.

The decoder generator introduced in [P12] is an example for updating the hardware description when changing the core configuration. The used configuration file for the first implementation as published in [P12] is based on an xls-sheet whereas the update of the decoder generator already uses the XML based configuration file. Further generator tools are planned for example for the full custom description for better support of scalability.

6.2 Future Research xDSPcore is a compiler based configurable DSP core whose main architectural features and the C-compiler were developed together with Vienna University of Technology and the Christian Doppler Laboratory of Prof. Dr. Andreas Krall �Compilation and De-compilation Techniques for Embedded Processor Architectures� based on the OCE (Open Compiler Environment) of ATAIR.

In this section further architectural details are illustrated which will be investigated and weighted for suitability to increase performance, reduce silicon area and decrease power consumption of the core subsystem in the future.

6.2.1 Multithreading Multithreading opens up the possibility to increase instruction level parallelism of core architectures, however not all multithreading variants are suitable for VLIW architectures [148].

In the area of research the different multithreading techniques for superscalar processor architectures are intensively investigated. A comparison between superscalar processors supporting single threads, fine-grain multithreading and simultaneous multithreading for superscalar processors can be found in [76][125][160]. The influence on the system performance and the additional hardware effort will be investigated which is necessary for implementing multithreading on a superscalar architecture. Simultaneous Multithreading has an advantage in increasing the utilization of available hardware resources on superscalar processors, due to the possibility to combine instruction-level parallelism and thread-level parallelism.

85

A comparison between fine-grain multithreading and coarse-grain multithreaded architectures can be found in [176]. The results indicate that only a few threads are needed to use the processor efficiently. Throughput on the node of a network can get the main bottleneck in several applications. In [67] so called Network Processors are analyzed. Benchmarks are used to compare performance results on an OOO (out-of-order) superscalar processor, a fine grained multithreaded processor, a single chip multiprocessor and a simultaneous multithreaded processor. The results for these kind of applications indicate the best performance results for the SMT (simultaneous multithreading) processor, due to the capability to handle instruction-level parallelism and thread-level parallelism.

Intel presents in [164] an approach using a SMT (simultaneous multithreading) Processor to overcome the cache miss problem for the data caches with speculative pre-computation. They analyzed that only a few loads are responsible for cache misses with a long latency utilized resources for speculative assumptions.

If more threads have to be handled than what the available hardware of the SMT processor supports then a kind of priority decision has to be taken, which threads will be executed during the next clock cycle. In [129] several approaches are discussed and their impact to the necessary hardware and software effort estimated. The results illustrate that a prioritization of the highest throughput threads can increase the CPU utilization by up to 15% (with only spending some internal counters)

The impact of Simultaneous Multithreading on the operating systems is analyzed in [137]. For the measurements the DEC Unix 4.0d OS has been modified. The results show that the SMT architectures provide a higher throughput rate with only a few minor adaptations of the operating system.

In [126] an approach is investigated to share register contents and to allow the usage of a single global register file for several threads in a SMT processor by register renaming. The results show that a boost of up to three is possible when between 2 and 4 threads are running in parallel.

Mini-threads are considered in [138] to overcome the drawback of increased register files due to the hardware support of several threads in parallel. The same register-file will be used by the mini-threads in parallel.

In [77] a method is analyzed to achieve simplification of issue queuing. In architectures supporting multithreading the issue queuing. Complexity is a significant problem concerning issue queuing in architectures supporting multithreading.

Commercial implementation of multithreading technology can for example be found in Alpha 21164 [75] which is one of the first products implementing SMT technology. It supports hardware for 4 threads in parallel (4-way SMT).

86

Intel is promoting hyper-threading technology [27]. Hyper-threading technology is based on a single physical processor which appears as a lot of logical processor resources. For each thread the architecture state is available and the threads use a single set of physical execution resources. On a program level it looks like a set of logical processors whereas on a micro-architecture level the threads are executed in parallel on the shared execution resources.

In May 2001 Imagination Technologies announced the Meta-1 architecture using multithreading technology. From their announcements one can assume that the Meta-1 supports fine-grain and simultaneous multithreading approaches specifically adapted to this core architecture.

In the area of VLIW architectures it is evident that less research is being carried out. One reason can be that due to the static instruction scheduling simultaneous multithreading is not as well suited for traditional VLIW architectures. The Starcore SC140 is the basis for the investigations in [108]. Simultaneous multithreading is introduced by supporting up to five tasks. The motivation to introduce multithreading has been to reduce system power dissipation for wireless applications. The drawback of the described approach is the additional pipeline stage needed to decide which functional unit will be used by which thread during the next cycle. Till now no commercial product from Starcore supporting simultaneous multithreading has been announced.

The most well known companies in the area of VLIW DSP architectures like Texas Instruments, Analog Devices and Motorola have no commercial products available supporting hardware based multithreading. They provide the handling of several threads in software (controlled by the RTOS). The core architecture supports different user levels and the possibility to restrict the use of memory addresses. In combination with an efficient task switch software based multithreading is possible.

Tensilica claims the support of multithreading for FLIX VLIW [41][101]. No architectural details are available to verify which kind of multithreading technology they support.

Sandbridge Technologies announced at the end of 2001 the support of multithreading techniques in the Sandblaster Multithreaded DSP [89]. The available description (white paper and product brief) provides no details about the supported technology and the influence on the system performance. Sandbridge Technologies announced a vector processor oriented architecture and efficient execution of JAVA code.

Summary for Multithreading

Multithreading is a known methodology for superscalar core architectures which can be used to increase the usage of available core resources. Making use of multithreading technology for VLIW architectures is limited by static instruction scheduling and therefore

87

missing support for dependency resolution during run-time. Slight modifications of the xLIW technology introduced in [P2] allow dynamic scheduling of execution bundles. This will be a topic of further research.

6.2.2 Code compaction Increasing code density in core subsystems reduces program memory. xDSPcore with the xLIW concept provides an efficient approach but the instructions are still plain coded. There are several ideas for code compaction available in [82][84][120][121][150].

The major problem of code compaction is the required time for unpacking [51]. Even that could be handled for linear code but the major code part is caused by control code traditionally with short branch distances. Further research will be carried out.

6.2.3 Design Space Exploration In [P10] and [P11] DSPxPlore the design space exploration methodology of xDSPcore is introduced. The presented concepts have to be improved and further research will enable automatically generated feedback to be provided. Making use of the DSPxPlore still requires a deep understanding of processor architectures and the influence of different features onto the core performance. In the near future DSPxPlore shall provide suggestions about which parameters should be changed to increase core performance of the core architecture (for the algorithms being executed on it). For example automatic assignment of binary coding to a chosen instruction set architecture to reduce switching activity at the program memory port is feasible.

In [P12] the decoder generator is introduced. Another example for making use of the XML-based configuration file could be a tool providing automatic generation or adaptation of the VHDL-RTL core description.

89

7 References

[1] �ADSP-21535, Blackfin DSP Hardware Reference�, Analog Devices Inc., Preliminary Edition, November 2001.

[2] �ADSP-21535, Blackfin DSP Hardware Reference�, Digital Signal Processor Division, Analog Devices Inc., Revision 1.0, November 2002.

[3] �An Introduction to Thumb�, Advanced RISC Machines Ltd., Version 2.0, March 1995.

[4] �ARCtangent - A5 Microprocessor with DSP Extensions�, White Paper, ARC International, USA, September 2003.

[5] �ARM7TDMI-S, Technical Reference Manual�, Advanced RISC Machines Ltd., Revision 3, 2000.

[6] �ARM920T, Technical Reference Manual�, Advanced RISC Machines Ltd., Revision 1, 2000.

[7] �ARM9TDMI, Technical Reference Manual�, Advanced RISC Machines Ltd., Revision 3, 2000.

[8] �Blackfin DSP Instruction Set Reference�, Digital Signal Processor Division, Analog Devices Inc., First Revision, March 2002.

[9] "Buyer�s Guide to DSP Processors, 2004 Edition", Berkeley Design Technology, Inc. (BDTi), 2004.

[10] �Carmel DSP Core Architecture Specification�, Infineon Technologies, 2001.

[11] �Carmel Architecture Overview�, Infineon Technologies North America, Revision 1.0, January 06, 2000.

[12] �Carmel 10xx Users Manual�, Infineon Technologies, June 16, 2001.

[13] �Choosing a DSP Processor�, Technical Report, Berkley Design Technologies, Inc., 2000.

90

[14] �Ceva-X Architecture, Ceva-X 1620 Datasheet�, Ceva Inc., 2003.

[15] �CPM Interrupt Controller�, Motorola Inc., 2002.

[16] "DECchip 21064-AA Microprocessor Hardware Reference Manual", Digital Equipment Corporation, 1992.

[17] �DECchip 21064 and DECchip 21064a Alpha AXP Microprocessors Hardware Reference Manual�, EC-Q9ZUA-TE, DEC, Maynard, Massachusetts, 1994.

[18] �Dhrystone Benchmark, History, Analysis, Scores and Recommendations�, White Paper, October 1, 2002.

[19] �DSP 16xx, Programmers Guide�, Lucent, 1997.

[20] �DSP56000 � 24 Bit Digital Signal Processor Family Manual, DSP56KFAMUM/AD, Motorola Inc., Austin, Texas, USA, 1995.

[21] �DSP56002 � 24 Bit Digital Signal Processor Users Manual, DSP56KFAMUM/AD, Motorola Inc., Austin, Texas, USA, 1995.

[22] �DSP 56300 Family Manual�, Motorola Inc., DSP56300FM/AD, Revision 0, May 2001.

[23] �DSP Compilers: Challenges for Efficient DSP Code Generation�, DACS Software Pvt. Ltd., Version 1.0, March 2001.

[24] �DSP-C Emulation from ACE Associated Compiler Experts Offers Design Flow Breakthrough for New Architectures�, ACE Associated Computer Experts, Orlando, Florida, USA, November 1, 1999.

[25] �Evaluating DSP Processor Performance�, Technical Report, Berkley Design Technologies, Inc., 2000.

[26] �How to Implement a Viterbi Decoder on the Starcore SC140�, Motorola Inc. Application Note, ANSC140VIT/D, Alpha Release, July 18, 2000.

[27] �Hyper-Threading Technology Architecture and Micro-Architecture�, Intel Technical Report, Intel Technology Journal, Volume 6, Issue 1, February 14, 2002.

91

[28] "Inside the StarCore SC140", Berkeley Design Technology, Inc. (BDTi), Berkeley, California, USA, 2000.

[29] �OAKDSP Core, Programmers Reference Manual�, Siemens, January 1998.

[30] �Power PC603 RISC Microprocessor Technical Summary�, MPC603/D, Motorola Inc., 1994.

[31] �Power PC620 RISC Microprocessor Technical Summary�, MPC620/D, Motorola Inc., 1994.

[32] �SC140 DSP Core Reference Manual�, Motorola Inc., MNSC140CORE/D, Revision 3, November 2001.

[33] �TMS320C54x DSP Reference Set, Volume 1: CPU and Peripherals�, Texas Instruments, SPRU131G, March 2001.

[34] �TMS320C55x DSP CPU Reference Guide, Preliminary Draft�, SPRU371D, Texas Instruments, May 2001.

[35] �TMS320C55x Technical Overview�, SPU393, Texas Instruments, February 2000.

[36] �TMS320C6000 CPU and Instruction Set Reference Guide�, Texas Instruments, October 2000.

[37] �TMS320C6000 Optimizing Compiler Tutorial�, Texas Instruments, SPRH046, 2001.

[38] �TMS320C6201 Technical Overview�, Texas Instruments, SPRS051G, January1997 (revised November 2000).

[39] �TMS320C64x Technical Overview�, Texas Instruments, SPRU395, February 2000.

[40] �Tricore 2, 32-bit Unified Processor Core, V2.0 Architecture�, Infineon Technologies, June 2003.

[41] �Xtensa Architecture and Performance, White Paper�, Tensilica Inc., September 2002.

92

[42] �ZSP 400, Digital Signal Processor, Architecture�, LSILogic Corporation, DB14-000121-03 (Fourth Edition), December 2001.

[43] A.V. Aho, J.D. Ulman, �Principles of Compiler Design�, Addison-Wesley, Narosa, 1999.

[44] D. Albert, D. Avnon, �Architecture of the Pentium Microprocessors�, IEEE Micro, pp. 11�21, June 1993.

[45] R. Allen, K. Kennedy, "Optimizing Compilers for Modern Architectures", Morgan Kaufmann Publishers, September 2001.

[46] A.W. Appel, �Modern Compiler Implementation in C�, Cambridge University Press, 2000.

[47] R. Arnold, F. Mueller, D. Whalley and M. Harmon, �Bounding Worst-Case Instruction Cache Performance�, IEEE Real-Time System Symposium, pp. 172-181, December 1994.

[48] P.J. Ashenden, �The Designers Guide to VHDL�, Morgan Kaufmann Publishers, San Francisco, California, USA, 1995.

[49] K. Baldwin et.al., �Guidelines for Efficient C Code Generation in Accumulator Based DSPs � the C54x as an Example�, Texas Instruments Inc., ESC 1999.

[50] S. Basumallick, K. Nilsen, �Cache Issues in Real-Time Systems�, in Proceedings of the 1st ACM SIGPLAN Workshop on Language, Compiler and Tool Support for Real-Time Systems, June 21, 1994.

[51] M. Benes, A. Wolfe, S.M. Nowick, �A High-Speed Asynchronous Decompression Circuit for Embedded Processors�, in Proceedings of the 17th Conference on Advanced Research in VLSI (ARVLSI�97), pp. 219-236, September 15-16, 1997.

[52] L. Benini, M. Favalli, B. Ricco, �Analysis of Hazard Contribution to Power Dissipation in CMOS IC's�, in IEEE International Workshop on Low Power Design, pp. 27-32, May 1994.

[53] M. Bramberger, �Design of a Cache Structure for a DSP Concept in Full-Custom Design�, Master Thesis, University of Technology, Graz, Austria, January 2002.

93

[54] P. Briggs, K.D. Cooper, L. Torczon, �Coloring Register Pairs�, ACM Letters on Programming Languages and Systems, pp. 3-13, March 1992.

[55] P. Briggs, K.D. Cooper, L. Torczon, �Improvements to Graph Coloring Register Allocation�, ACM Transactions on Programming Language and Systems, Volume 16, Issue 3, pp. 428-455, May 1994.

[56] T.D. Burd, R.W. Brodersen, �Design Issues for Dynamic Voltage Scaling�, in Proceedings of the 2000 International Symposium on Low Power Electronics and Design, Rapallo, Italy, pp. 9-14, 2000.

[57] B, Burgess, N. Ullah, P. Overen, D. Ogden, �The PowerPC 603 Microprocessor�, Communications of the ACM, pp. 34�42, June 1994.

[58] B. Burgess, M. Alexander, H. Yingh-wai, S.P. Licht, S. Mallick, D. Ogden, P. Sung-Ho, J. Slaton, �The PowerPC 603 Microprocessor: A High Performance, Low Power, Superscalar RISC Microprocessor�, in Proceedings COMPCON, San Francisco, California, USA, pp. 300-306, February 28 � March 4, 1994.

[59] J.A. Butts and G.S. Sohi, �A Static Power Model for Architects�, in Proceedings of the 33rd Annual International Symposium on Microarchitecture, Monterey, California, USA, pp. 191-201, December 10-13, 2000.

[60] G.J. Chaitin, �Register Allocation and Spilling via Graph Coloring�, in Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction (SIGPLAN�82), Boston, Massachusetts, USA, pp. 98-105, 1982.

[61] A. Chandrakasan, et al., �Design Considerations and Tools for Low-Voltage Digital System Design�, in Proceedings of 33rd Annual Conference on Design Automation Conference, Las Vegas, Nevada, USA, pp. 113-118, 1996.

[62] A. Chandrakasan, S. Sheng, R. Brodersen, �Low-power CMOS design�, IEEE Journal of Solid State, pp. 472-484, April 1992.

[63] G. Chen and M.D. Smith, �Reorganizing Global Schedules for Register Allocation�, Proceedings of the 1999 Conference on Supercomputing, ACM SIGARCH, ACM Press, New York, pp. 408-416, June 20-25, 1999.

94

[64] D.R. Coelho, �The VHDL Handbook�, Kluwer Academic Publishers, Norwell, MA 02061, USA, 1990.

[65] A. Colin, I. Puaut, �Worst Case Execution Time Analysis for a Processor with Branch Prediction�, Journal of Real-Time Systems, Special Issue on Worst-Case Execution Time Analysis, pp. 249-274, April 2000.

[66] K.D. Cooper, P. Briggs, K. Kennedy, L. Torczon, �Coloring Heuristics for Register Allocation�, SIGPLAN Notices, pp. 275-284, July 1989.

[67] P. Crowley, M.E. Fiuczynski, J.-L. Baer, B.N. Bershad, �Characterizing Processor Architectures for Programmable Network Interfaces�, in Proceedings of the 2000 International Conference on Supercomputing, Danta Fe, New Mexico, USA, pp. 54-64, May 2000.

[68] R. Cytron, J. Ferrante, B.K. Rosen, M.N. Wegman, F.K. Zadeck, �An Efficient Method of Computing Static Single Assignment Form�, in Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Austin, Texas, USA, pp. 25-35, 1989.

[69] R. Cytron, J. Ferrante, B.K. Rosen, M.N. Wegman, F.K. Zadeck, �Efficiently Computing Static Single Assignment Form and the Control Dependence Graph�, ACM Transactions on Programming Languages and Systems (TOPLAS), pp. 451-490, October 1991.

[70] B. Davari, �CMOS Technology Scaling, 0.1 mm and Beyond�, in Proceedings of the International Electron Devices Meeting, pp. 555-558, 1996.

[71] A. Davis, �Tips for Writing more Efficient DSP C Code�, TI's Software Development Systems Semiconductor Group, EDN Magazine, June 1997.

[72] V. De, �Leakage-Tolerant Design Techniques for High Performance Processors (Invited Paper)�, in Proceedings of the 2002 International Symposium on Physical Design (ISPD�02), San Diego, California, USA, p. 28, April 7-10, 2002.

[73] V. De, S. Borkar, �Technology and Design Challenges for Low Power and High Performance�, in Proceedings of the International Symposium on Low Power Electronics and Design, San Diego, California, USA, pp. 163-168, 1999.

95

[74] S. Dropsho, �Real-Time Penalties in RISC Processing�, Technical Report at Department of Computer Science University of Massachusetts-Amherst (TR-95-110), December 12, 1995.

[75] J. Edmondson, P. Rubinfeld, R. Preston, V. Rajagopalan, �Superscalar Instruction Execution in the 21164 Alpha Microprocessor�, IEEE Micro, Volume 15, Issue 2, pp. 33�43, April 1995.

[76] S. Eggers, J. Elmer, H. Levy, R. Stamm and D. Tullson, �Simultaneous Multithreading: A Foundation for next Generation Processors�, IEEE Micro, pp. 17-22, August 1997.

[77] A. El-Moursy, D.H. Albonesi, �Front-end Policies for Improved Issues Efficiency in SMT Processors�, in Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA'03), Anaheim, California, pp. 31, February 08 - 12, 2003.

[78] P. Elbischger, �Performance and Architectural Analysis of a Digital Signal Processor�, Master Thesis, University of Technology, Graz, Austria, June 2001.

[79] J. Engblom, �Processor Pipelines and Static Worst-Case Execution Time Analysis�, PhD Thesis, Department of Computer Systems Uppsala University, Uppsala, 2002.

[80] J. Engblom, A. Ermedahl, M. Sjödin, �Worst-Case Execution-Time Analysis for Embedded Real-Time Systems�, in Software Tools for Technology Transfer (STTT) Special Issue on ASTEC, 2001.

[81] A. Ermedahl, J. Gustafsson, �Deriving Annotations for Tight Calculation of Execution Time�, in Proceedings Euro-Par�97 Parallel Processing, Springer Verlag, pp. 1298-1307, August 1997.

[82] J. Ernst, W. Evans, C.W. Fraser, S. Lucco, T.A. Proebsting, �Code Compression�, in Proceedings of the ACM SIGPLAN�97 Conference on Programming Language Design and Implementation (PLDI), Las Vegas, Nevada, USA, pp. 358-365, June 15-18, 1997.

[83] C. Ferdinand, F. Martin, R. Wilhelm, �Applying Compiler Techniques to Cache Behavior Prediction�, in Proceedings ACM SIGPLAN Workshop on Languages, Compilers and Tools for Real-Time Systems, Las Vegas, Nevada, USA, June 15, 1997.

96

[84] G. Fettweis, �Embedded SIMD Vector Signal Processor Design�, SAMOS Workshop, Samos, Greece, July 21-23, 2003.

[85] K. Flautner, N.S. Kim, S. Martin, D. Blaauw, T. Mudge, D. Caches, �Simple Techniques for Reducing Leakage Power�, International Symposium on Computer Architecture, Anchorage, Alaska, pp. 148-157, May 25-29, 2002.

[86] C.W. Fraser, D.R. Hanson, T.A. Proebsting, �Engineering a Simple, Efficient Code-Generator Generator�, ACM Letters on Programming Languages and Systems (LOPLAS), Volume 1, Issue 3, pp. 213-226, September 1992.

[87] L. George, A.W. Appel, �Iterated Register Coalescing�, ACM Transactions on Programming Languages and Systems, pp. 300-324, May 1996.

[88] V.S. Gierenz, R. Schwann, T.G. Noll, �A Low Power Digital Beamformer for Handheld Ultrasound Systems�, in Proceedings of the ESSCIRC 2001, Villach, Austria, pp. 276-279, September 18-20, 2001.

[89] J. Glossner, E. Hokenek, M. Mondgibl, �An Overview of the Sandbridge Processor Technology�, White Paper, Sandbridge Technology Inc., 2002.

[90] J.R. Goodman, W. Hsu, �Code Scheduling and Register Allocation in large Basic Blocks�, International Conference on Supercomputing, St. Malo, France, pp. 442-452, 1988.

[91] J. Gustafsson, �Analyzing Execution-Time of Object-Oriented Programs Using Abstract Interpretation�, PhD Thesis, Department of Computer Systems, Information Technology, Uppsala University, May, 2000.

[92] J. Handy, �The Cache Memory Book�, Academic Press, 1998.

[93] C. Healy, M. Sjödin, V. Rustagi, D. Whalley, �Bounding Loop Iterations for Timing Analysis�, in Proceedings of the 4th IEEE Real-Time Technology and Applications Symposium (RTAS�98), pp. 12, June 3-5, 1998.

[94] C. Healy, R. Arnold, F. Müller, D. Whalley, M. Harmon, �Bounding Pipeline and Instruction Cache Performance�, IEEE Transactions on Computers�, Volume 48, Issue 1, pp. 53-70, January 1999.

97

[95] F. Hedley, �ARM DSP-Enhanced Extension�, ARM White Paper, ARM Ltd., May 2001.

[96] J.L. Hennessy, D.A. Patterson, �Computer Architecture. A Quantitative Approach�, San Mateo CA, Morgan Kaufmann Publishers, 1996.

[97] J.L. Hennessy, D.A. Patterson, �Computer Organization and Design: The Hardware/Software Interface�, Morgan Kaufmann Publishers, 2nd edition, 1997.

[98] U. Hirnschrott, �DSP Compiler Optimization�, Master Thesis, Vienna University of Technology, Vienna, Austria, January 2001.

[99] U. Hirnschrott, A. Krall, �VLIW Operation Refinement for Reducing Energy Consumption�, in Proceedings 2003 International Symposium on System-on-Chip (SOC�03), Tampere, Finland, pp. 131-134,November 19-21, 2003.

[100] U. Hirnschrott, A. Krall ,B. Scholz, "Graph Coloring versus Optimal Register Allocation for Optimizing Compilers", In Proceedings of the International Conference on Compilers, Architectures and Synthesis of Embedded Systems (CASES'02), Grenoble, France, pp. October 8-11, 2002.

[101] B. Huffman, M.A. Mohamed, �Flexible Length Instruction Extensions � The Next Generation Xtensa ISA�, Microprocessor Forum 2002, October 16, 2002.

[102] W. Hwu, �Computer Microarchitecture: Hardware and Software�, Lecture material, University of Illinois, Urbana-Champaign, 1999.

[103] E.C. Ifeachor, B.W. Jervis, �Digital Signal Processing, A Practical Approach�, Prentice Hall, Second Edition, 2002.

[104] M. Johnson, D. Somasekhar, K. Roy, �Models and Algorithms for Bounds on Leakage in CMOS Circuits�, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 18, Number 6, pp. 714-725, June 1999.

[105] R. Johnson, M.S. Schlankser, �Analysis Techniques for Predicated Code�, in Proceedings of the 29th Annual International Symposium on Mircoarchitecture, Paris, France, pp. 100-113, December 2-4, 1996.

98

[106] N.P. Jouppi, �The nonuniform Distribution of Instruction-level and Machine Parallelism and its Effect on Performance�, in IEEE Transactions on Computers, pp. 1645-1658, December 1989.

[107] J. Kao, A. Chandrakasan, �Dual-Threshold Voltage Techniques for Low-Power Digital Circuits�, in IEEE Journal of Solid-State Circuits, Volume 35, Issue 7, pp. 1009-1018, July 2000.

[108] S. Kaxiras, G. Narlikar, A.D. Berenbaum, Z. Hu, �Comparing Power Consumption of an SMT and a CMP DSP for Mobile Phone Workloads�, In CASES�01, Atlanta, Georgia, USA, pp. 211-220, November 16-17, 2001.

[109] B. Kerningham, D. Ritchie, �The C Programming Language�, Prentice-Hall Software Series, 2nd Edition, March 22, 1988.

[110] M. Ketkar, S.S. Sapatnekar, P. Patra, �Convexity-Based Optimization for Power-Delay Tradeoff using Transistor Sizing�, in Proceedings of the IEEE/ACM International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems (Tau�00), Austin, Texas, USA, pp. 52-57, 2000.

[111] S.K. Kim, S.L. Min, R. Ha, �Efficient Worst Case Timing Analysis of Data Caching�, in Proceedings of 2nd IEEE Real-Time Technology and Applications Symposium (RTAS�96), Boston, MA, USA, pp. 230-240, June 10-12, 1996.

[112] D.J. Kuck, Y. Muraoka, S.-C. Chen, �On the Number of Operations Simultaneously Executable in Fortran-like Programs and their Resulting Speedup�, IEEE Transactions on Computers, pp.1293-1310, December 1972.

[113] M. Kuulusa, �DSP Processor Core-Based Wireless System Design�, PhD Thesis, Digital and Computer Systems Laboratory, Tampere University of Technology, August 18, 2000.

[114] M. Lam, �Software Pipelining: an Effective Scheduling Technique for VLIW Machines�, in Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, Atlanta, Georgia, USA, pp. 318-328, 1988.

[115] M.S. Lam, R.P. Wilson, �Limits of Control Flow on Parallelism�, in Proceedings of the 19th Annual International Symposium on Computer Architecture (AISCA), Queensland, Australia, pp. 46-57, 1992.

99

[116] P. Lapsley, J. Bier, A. Shoham and E.A. Lee, �DSP Processor Fundamentals, Architectures and Features�, New York, IEEE Press, 1997.

[117] G. Laure, �Creation of a Configurable Component Based Framework�, Draft of Master Thesis, University of Technology, Graz, Austria, June 2004.

[118] W. Lazian, �Simulation of a DSP through a Component Based Framework�, Draft of Master Thesis, University of Technology, Graz, Austria, June 2004.

[119] R.A. Lebeck, D.A. Wood, �Cache Profiling and the SPEC Benchmarks: A Case Study�, IEEE Computer, pp. 15�26, October 1994.

[120] C. Lefurgy, P. Bird, I.C. Chen, T. Mudge, �Improving Code Density using Compression Techniques�, in Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (Micro) , Research Triangle Park, NC, USA, pp. 194-203, December 1-3, 1997.

[121] C. Lefurgy, T. Mudge, �Code Compression for DSP�, in Proceedings of Compiler and Architecture Support for Embedded Computing Systems (CASES 98), December 4-5, 1998.

[122] R. Leitner, �VHDL Model of a Digital Signal Processor�, Master Thesis, University of Technology, Graz, Austria, March 2001.

[123] M. Levy, �C Compilers for DSPs flex their Muscles�, EDN Magazine, June 1997.

[124] S.S. Lim, Y.H. Bae, C.T. Jang, B.D. Rhee, S.L. Min, C.Y. Park H. Shin, K. Park, C.S. Ki, �An Accurate Worst-Case Timing Analysis for RISC Processors�, IEEE Transactions on Software Engineering, Volume 21, Issue 7, pp. 593-604, July 1995.

[125] J.L. Lo, S. Eggers, J.S. Emer, H.M. Levy, R.L. Stamm, D.M. Tullson, �Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading�, ACM Transactions on Computer Systems, Volume 15, Number 3, pp. 322-354, August 1997.

[126] J.L. Lo, S.S. Parekh, S.J. Eggers, H.M. Levy, D.M. Tullson, �Software-Directed Register Deallocation for Simultaneous Multithreaded Processors�, IEEE Transactions on Parallel and Distributed Systems, Volume 10, Issue 9, pp. 922-933, September 1999.

100

[127] F. Müller, �Timing Predictions for Multi-Level Caches�, in Proceedings of ACM SIGPLAN Workshop on Languages, Compilers and Tools for Real-Time Systems, pp. 29-36, June 1997.

[128] S. Palnitkar, �Verilog HDL � A Guide to Digital Design and Synthesis�, Sun Microsystems, Mountain View, California, USA, 1996.

[129] S. Parekh, S. Eggers, H. Levy, J. Lo, �Thread-Sensitive Scheduling for SMT Processors�, Technical Report, Department of Computer Science, University of Washington, 2000.

[130] J. Park and S.M. Moon, �Optimistic Register Coalescing�, in Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques (PACT�98), Paris, France, pp. 196-204, October 12-18, 1998.

[131] T. Pering, T. Burd, R. Brodersen, �The Simulation and Evaluation of Dynamic Voltage Scaling Algorithms�, in Proceedings of the 1998 International Symposium on Low Power Electronics and Design (ISPLED), Monterey, California, USA, pp. 76-81, 1998.

[132] D.N. Pnevmatikatos, G.S. Sohi, �Guarded Execution and Branch Prediction in dynamic ILP Processors�, Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA), pp. 120-129, 1994.

[133] U. Ramacher, W. Raab, N. Brüls, U. Hachmann, C. Sauer, A. Schackow, J. Gliese, J. Harnisch, M. Richter, E. Sicheneder, R. Schüffny, U. Schulze, H. Feldkämper, C. Lütkemeyer, H. Süsse, S. Altmann, �A 53-GOPS Programmable Vision Processor for Processing, Coding-Decoding and Synthesizing of Images�, in Proceedings of European Solid State Conference (ESSCIRC), Villach, Austria, September 18-20, 2001.

[134] B.R. Rau, �Iterative Modulo Scheduling: an Algorithm for Software Pipelining Loops�, in Proceedings of the 27th Annual International Symposium on Microarchitecture, ACM Press, pp. 63-74, 1994.

[135] J. Rawat, �Static Analysis of Cache Performance for Real-Time Programming�, Technical Report TR93-19, Iowa State University of Science and Technology, November 1993.

101

[136] J. Rayfield, �Using HLLs to Develop DSP Applications�, ARM Inc., DSP World Fall 1999.

[137] J.A. Redstone, S.J. Eggers, H.M. Levy, �An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture�, in Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 35, Issue 11, pp. 245-256, November 2000.

[138] J. Redstone, S. Eggers, H. Levy, �Mini-threads: Increasing TLP on Small-Scale SMT Processors�, in Proceedings of the 9th International Symposium on High Performance Computer Architecture (HPCA-9), pp. 19-30, February 8-12, 2003.

[139] S. Rele, S. Pande, S. Onder, R. Gupta, �Optimizing Static Power Dissipation by Functional Units in Superscalar Processors�, in Proceedings of the 11th International Conference on Compiler Construction, pp. 261-275, April 8-12, 2002.

[140] R.J. Ridder, �Trends in Application Programming for Digital Signal Processing�, Tasking Inc., ESC 1999.

[141] K. Roy, �Leakage Power Reduction in Low-Voltage CMOS Design�, in Proceedings of the IEEE International Conference on Circuits and Systems, pp. 167-173, 1998.

[142] J. Runeson, S.-O. Nyström, �Retargetable Graph-Coloring Register Allocation for Irregular Architectures�, in Proceedings of 7th International Workshop, SCOPES 2003, Vienna, Austria, pp. 240-254, September 18-20, 2003.

[143] S. Rusu, �Trends and Challenges in VLSI Technology Scaling Towards 100 nm�, Invited Talk at European Solid State Conference (ESSCIRC 2001), Villach, Austria, September 18-20, 2001.

[144] R.H. Saavedra, A.J. Smith, �Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes�, IEEE Transactions on Computers, Volume 44, Issue 10, pp. 1223�1235, October 1995.

[145] A. Schilke, �An Automatic Decoder Generator for a Scalable DSP Architecture�, Master Thesis, Carinthian Tech Institute, Villach, Austria, September 2002.

[146] J. Schneider, C. Ferdinand, �Pipeline Behavior Prediction for Superscalar Processors by Abstract Interpretation�, in Proceedings of the ACM SIGPLAN 1999 Workshop on

102

Languages, Compilers and Tools for Embedded Systems, ACM, Atlanta, Georgia, USA, pp. 35-44, May 1999.

[147] B. Scholz, E. Eckstein, �Register Allocation for Irregular Architectures�, in Proceedings of the Joint Conference on Languages, Compilers and Tools for Embedded Systems, ACM Press, pp. 139-148, 2002.

[148] J. Silc, B. Robic, T. Unger, �Processor Architecture, From Dataflow to Superscalar and Beyond�, Springer-Verlag, Berlin-Heidelberg, 1999.

[149] D. Sima, T. Fountain, P. Kacsuk, �Advanced Computer Architectures: A Design Space Approach�, Addison Wesley Publishing Company, Harlow, 1997.

[150] P. Simonen, I. Saastamoinen, J. Nurmi, �Variable-Length Instruction Compression for Area Minimization�, in the 14th International Conference on Application-specific Systems, Architectures and Processors, The Hague, Netherlands, pp. 155-160, June 24-26, 2003.

[151] V. Sipkova, �Efficient Variable Allocation to Dual Memory Banks of DSPs�, In Proceedings of the 7th International Workshop on Software and Compilers for Embedded Systems (SCOPES'03), Vienna, Austria, pp. 359-372, September 2003.

[152] M.D. Smith and G. Holloway, �Graph-Coloring Register Allocation for Irregular Architectures�, PLD�01, 2001.

[153] J.E. Smith, �A Study of Branch Prediction Strategies�, in Proceedings of the 8th Annual Symposium on Computer Architecture (ASCA), Minneapolis, Minnesota, USA, pp. 135-148, 1981.

[154] F. Stappert, P. Altenbernd, �Complete Worst-Case Execution Time Analysis of Straight-line Hard Real-Time Programs�, Journal of Systems Architecture, Volume 46, Number 4, pp. 339-355, February 2000.

[155] F. Stappert, J. Engblom, A. Ermedahl, �Efficient Longest Executable Path Search for Programs with Complex Flows and Pipeline Effects�, Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, Atlanta, Georgia, USA, pp. 132-140, 2001.

103

[156] A. Stratakos, R. W. Brodersen, S. R.Sanders, �High-Efficiency Low-Voltage DC-DC Conversion for Portable Applications�, in Proceedings of 1994 International Workshop on Low-Power Design, Napa Valley, CA, April 1994.

[157] R. Sucher, R. Niggebaum, G. Fettweis, A. Rom, �Carmel � A New High Performance DSP Core Using CLIW�, 9th Annual International Conference on Signal Processing Applications and Technology, pp. 499-503, September 1998.

[158] S. Thompson, P. Packan, M. Bohr, �CMOS Scaling: Transistor Challenges for the 21st Century�, Intel Technology Journal, Q3. 1998.

[159] G.S. Tjaden, M.J. Flynn, �Detection and Parallel Execution of Independent Instructions�, IEEE Transactions on Computers, Vol. C-19, No. 10, pp. 889-895, October 1970.

[160] D.M. Tullson, S.J. Eggers, and H.M. Levy, �Simultaneous Multithreading: Maximizing On-Chip Parallelism�, In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA�95), Santa Margherita Liguere, Italy, pp. 392, June 22-24, 1995.

[161] H.J.M. Veendrick, �Short-Circuit Dissipation of Static CMOS Circuitry and its Impact on the Design of Buffer Circuits�, IEEE Journal of Solid State Circuits, Volume SC-19, pp. 468�473, August 1984.

[162] G. Vink, �Programming DSPs using C: Efficiency and Portability Trade-offs�, Journal of Embedded Systems, May 2000.

[163] K. Vögler, �A DSP C-Compiler�, Master Thesis, Vienna University of Technology, Vienna, Austria, April 2002.

[164] H. Wang, P.H. Wang, R.D. Weldon, S.M. Ettinger, H. Saito, M. Girkar, S. Shihwei, J.P. Shen, �Speculative Precomputation: Exploring the Use of Multithreading for Latency�, Intel Technology Journal, Volume 6, Issue 1, Q1. 2002.

[165] L. Wanhammar, �DSP Integrated Circuits�, Academic Press, February 1999.

[166] H.S. Warren, �Instruction Scheduling for the IBM RISC System/6000 Processor�, IBM J. Res. Dev., pp. 85-92, 1990.

104

[167] O. Weiss, M. Gansen, T.G. Noll, �A Flexible Datapath Generator for Physical Oriented Design�, Proceedings of the ESSCIRC 2001, Villach, Austria, pp. 408-411, September 18-20, 2001.

[168] D.B. Whalley, R. Arnold, F. Mueller, M. Harmon, �Bounding Worst-Case Instruction Cache Performance�, IEEE Symposium on Real-Time Systems, pp. 172�181, December 1994.

[169] WG14 N1005, �Programming Languages, their Environments and System Software Interfaces - Extensions for the Programming Language C to Support Embedded Processors�, ISO/IEC DTR 18037.2, April 25, 2003.

[170] R. White, F. Müller, C. Healy, D. Whalley, M. Harmon, �Timing Analysis for Data Caches and Set-Associative Caches�, in Proceedings of 3rd IEEE Real-Time Technology and Applications Symposium (RTAS�97), Montreal, Canada, pp. 192-202, June 9-11, 1997.

[171] M. Willems, H. Keding, V. Zivojnovic, �Modulo-Addressing Utilization in Automatic Software Synthesis for Digital Signal Processors�, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-97), Munich, Germany, Volume 1, pp. 687-690, April 21-24, 1997.

[172] R. Yates, �Fixed-Point Arithmetic: An Introduction�, Digital Sound Labs, March 3, 2001.

[173] T.Y. Yeh, Y.N. Patt, �Alternative Implementations of Two-Level Adaptive Branch Predictions�, in Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA), Queensland, Australia, pp.124-134, 1992.

[174] S. Zammattio, �How to Reduce Time-to-Market for System-on-Chip Design�, White paper ARC International, 2002.

[175] W. Zhang, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, D. Duarte, Y-F. Tsai, �Exploiting VLIW Schedule Slacks for Dynamic and Leakage Energy Reduction�, in Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, IEEE Computer Society, Austin, Texas, USA, pp. 102-113, 2001.

[176] W.M. Zuberek, �Performance Comparison of Fine-Grain and Block Multithreaded Architectures�, in Proceedings High Performance Computing (HPC�2000), 2000

105

Advanced Simulation Technologies Conference, Washington DC, USA, pp. 383-388, April 16-20, 2000.

PART II

PUBLICATIONS

PUBLICATION 1

C. Panis, J. Nurmi, �xDSPcore - a Configurable DSP Core�, Technical Report 1-2004, Tampere University of Technology, Institute of Digital and Computer Systems, Tampere, Finland, May 2004.

©2004 Tampere University of Technology. Reprinted, with permission, from Technical Report 1-2004.

xDSPcore - a Configurable DSP Core Christian Panis (Carinthian Tech Institute)

Jari Nurmi Institute of Digital and Computer Systems

Tampere University of Technology Tampere, Finland

Abstract Exponentially increasing mask costs for submicron technologies have lead to a strong demand on reconfigurable hardware and software-programmable embedded cores for SoC (System on Chip) applications. General purpose DSP cores suffer from inadequate usage of the core resources and therefore an overhead in silicon area and power dissipation. The programs executed on these cores are written in assembly language to prevent even more overhead caused by poorly performing C-compilers. This paper is used to introduce xDSPcore, a configurable DSP core architecture efficiently programmable in C. The main architectural features with influence on the overall system performance can be scaled or configured to meet application-specific requirements and therefore to reduce wasting of silicon area and power dissipation. Programming in C enables architecture-independent description of the algorithms and overcomes software compatibility issues. A methodology called DSPxPlore is used for design space exploration to understand the requirements of the application in an early project phase for refining the core architecture configuration.

Introduction Increasing complexity of SoC applications increases the demand on powerful embedded cores. The flexibility provided by the usage of software programmable cores quite often leads to an increase in consumed silicon area and an increased power dissipation. Therefore dedicated hardware has been favored over software-based platform solutions. The picture is changing, significant increasing mask costs due to the use of advanced process technologies and difficulties to enter such high-volume products to the heterogeneous market that would justify the high non-recurring cost together increase the pressure for developing product platforms. These platforms are used for a group of applications such that software executed on programmable core architectures is used for differentiating the products.

Embedded general purpose core architectures cause an inefficient use of the core resources. Each class of applications has requirements, which cannot be efficiently covered by a general-purpose architecture. To close the gap between dedicated hardware and software-based solutions, a platform-specific adaptation of the core architecture is required.

For embedded DSP cores (Digital Signal Processors) an additional problem is virulent. Due to non-orthogonal core architectures which have been preferred because of obtaining better performance and less area consumption when mapping DSP algorithms onto a processor, DSPs are still programmed manually in assembly language [1]. The price for the better usage of the available processor resources is an architecture-dependent description of the algorithms which makes changes in the core architecture difficult and costly (due to compatibility issues) and prohibits application-specific adaptations. Therefore products based on a programmable core architecture are sticking to the same architecture for a long time frame, even if not state-of-the-art any more. Additional risks and costs by changing the core architecture thus lead to solutions that are not competitive.

Additional consequences from using assembly language are long development cycles. 10 years ago algorithms executed on DSP cores consisted of several hundred lines of code. Manual coding was

1

reasonable even if minor changes in the application code required several weeks of coding and verification. Today�s DSP cores are more powerful and enable to execute large programs consisting of several hundred thousand lines of code. DSP cores are not only used for filtering operations any more. Especially in low cost products where due to cost reasons not more than one core is reasonable, the control code is executed on the DSP core.

To increase the performance of the DSP subsystems a high degree of parallelism and deep pipeline structures are provided. Unfortunately, manual programming of highly parallelized DSP core architectures with deep pipelines, and resolving the data and control dependencies manually is limited or even impossible. Therefore the motivation of using assembly code to increase the use of the available resources is not valid any more.

In late 1999, the development of xDSPcore was started based on considerations on the described aspects. To overcome the problem of programming issues and the increased time-to-market pressure, one aim of this project is to provide a DSP core architecture which can be efficiently programmed in C. Efficiently means in this case less than 10% overhead on application level compared with manual assembly programming. Caring about the aspect of less area consumption and low power dissipation, it is possible to configure the main architectural features of the DSP core to application-specific requirements. Compatibility issues between different core architectures are covered by the architecture independent algorithm description (e.g. Embedded-C [2]) and the automatic assembly generation.

Different to available DSP core architectures whose architecture is mainly influenced by algorithmic aspects and limited by hardware restrictions, aspects of software development have been considered already during the definition of the instruction set and the core architecture, to enable the development of an efficient C-Compiler.

This paper is introducing xDSPcore. The first part is used for an overview of the xDSPcore architecture. The requirements of the C-compiler (like orthogonality and not using configuration and mode registers) and the aspect of scalability have been influencing the main architectural features. The motto �keep it simple� helps to develop efficient tools and to allow changes in the implementation.

The second part is used to introduce DSPxPlore, the design space exploration methodology for xDSPcore. A configurable and scalable DSP core leads to the question: �which kind of core architecture is well suited to meet my application requirements�. DSPxPlore can be used to evaluate how efficiently an application code can make use of the provided resources of the DSP core. These results can be used to refine the HW/SW partitioning and to refine the core architecture to optimize the core subsystem to the application requirements. DSPxPlore enables to close the gap between dedicated hardware and software-based solutions.

The third part is used to discuss aspects of configurability. To keep the configuration parameters consistent, a unique configuration file based on XML is used. In the beginning of part three the structure of the configuration file is introduced. The configuration file is used for generating parts of the VHDL-RTL code, as basis for the tool chain and for automatically updating documentation. For illustration the VHDL code generator xSIM, a cycle true ISS (Instruction Set Simulator) and the documentation generation flow are briefly introduced.

2

State-of-the-art This section is used to briefly introduce available concepts and to position the DSP core project presented in this paper.

Traditional DSP core architectures The latest announcements and products based on traditional DSP core architectures are Starcore SC1200/SC1400 [3] announced by the Starcore LCC, which is a cooperation between Motorola, Agere and Infineon Technologies and Blackfin [4], the outcome of a cooperation between Analog Devices and Intel Inc. Both core concepts are RISC (Reduced Instruction Set Computer) based load-store architectures, claiming to be efficiently be programmed in high-level languages like C/C++.

The ISA (Instruction Set Architecture) and the core architecture are fixed, which prevents application specific modifications, which is a requirement to close the gap between hardwired implementations and software based solutions.

Scaleable core architectures The most famous representatives for scaleable cores are Tensilica and ARC [5]. Both concepts are based on traditional microcontroller architectures e.g. Tensilica�s Xtensa [6] is based on the MIPS architecture. Therefore efficient implementation of traditional DSP algorithms is not possible and aspects like minimizing the worst case execution time are not sufficiently covered. The software support for using the DSP specific features is quite insufficient. By adding �just an additional MAC unit� the main focus is to increase theoretical performance, instead of analyzing the overall system performance.

Architecture Description Languages LISA (Language for Instruction Set Architecture) from Coware [7] mainly developed at RWTH Aachen [8] is the most famous architecture description language. Later projects include e.g. ArchC project running in Brazil [9]. The concept of defining your own specific core architecture fulfilling the requirements of your application code sounds brilliant. However, generating the core microarchitecture described on behavior level automatically results in poorly used silicon.

The large solution room provided by these concepts allows describing any kind of core architecture, but only a few of them allow the development of an optimizing high-level language compiler. Most of the design teams are interested in integrating a core into their System-on-Chip (SoC) or System-in-Package (SiP) solution. For using an architecture description language like LISA and generating efficient solutions, deep processor architecture knowledge is required.

Design Space Exploration, strongly required when supporting scaling or configuring a core architecture has to be based on a high-level language compiler. The generation of high-level language compilers is still not feasible, and even considering approaches like COSY from ACE [10], the quality of the code produced by an automatically generated compiler is poor. The quality of the generated code which is then basis for architectural modifications is important, whereas poor results can be even misleading. The problem can be summarized as chicken-egg problem, whereas understanding the application requirements onto a core architecture requires an efficient high-level-language compiler, which is not feasible to be generated for each core architecture.

Summary of state-of-the art The briefly introduced concepts of this section can be split into three groups, traditional DSP core architectures lacking the possibility of application specific adaptations, scaleable core architectures,

3

lacking in support for efficiently implementing DSP algorithms, and last but not least architecture description languages, which open a large solution room, but are lacking efficient software support and mapping to HW implementations, which is a strong requirement for success.

The concept of xDSPcore introduced in this paper tries to solve the above described problems. xDSPcore is a general purpose DSP core architecture based on RISC load-store and enables efficient execution of traditional DSP algorithms. This includes also system aspects like the possibility for minimizing the worst-case execution time. To close the gap between hardwired ASIC implementations and software based solutions the core concept enables scaling of the main architectural features, whereas the micro-architectural concept is not changed. The micro-architecture has been defined under consideration of the requirements for developing an optimizing C-compiler, which enables to analyze the design space of a certain application code. To keep validation and verification effort low, a unique configuration file based on XML is introduced which allows scaling core features without struggling with taking care that all the changes are considered in hardware, tool-chain and documentation.

4

DSP Core Architecture This section is used for briefly introducing the xDSPcore architecture. Already during the definition of the instruction set architecture and micro-architecture the development of an efficient C-Compiler has been considered [11]. To enable an area and power efficient DSP subsystem, the architectural features with significant influence on area and power consumption can be configured. Adapting the core subsystem to application specific requirements allows reducing the gap between systems based on dedicated hardware and software-based solutions. The configuration aspects are considered in a later section.

Overview xDSPcore is based on a modified Dual-Harvard load-store architecture [11]. VLIW (Very Long Instruction Word) is used as programming model, where static scheduling allows reducing the core complexity since resolving data and control dependencies is done during compile time [12]. In Figure 1 a generic overview of the core architecture is given. Two data busses are connecting the data memory with the DSP core; for preventing address conflicts the addresses are interleaved. An independent data bus connects the program memory and is used for fetching instructions. Data and program memory have different address spaces. The register file plays a central role in the core architecture. Load/store architectures feature instructions for moving entries from data memory to the register file and from register file back to memory. Separate instructions are used for coding arithmetic functions. The size of the memory ports is scaleable (therefore no values are assigned in Figure 1). An instruction buffer between program memory and instruction decoder is used to increase code density and for reducing power dissipation during the execution of loop constructs which are quite often a central part of traditional DSP algorithms.

Figure 1: Core Overview.

xDSPcore is featuring a 3-phase RISC pipeline with instruction fetch, decode and execute. For increasing the reachable clock frequency (especially for relaxing the timing at the memory ports) the three pipeline phases can be split over several clock cycles. The example architecture illustrated in Figure 2 is using five clock cycles for mapping the three basic pipeline phases.

Figure 2: Pipeline.

The instruction fetch phase is split over two clock cycles, fetch and align, the decode stage takes only one clock cycle and the execution phase is again split over two clock cycles (EX1, EX2). Using several clock cycles for one pipeline phase can be used for reaching higher core clock frequencies, but considering system aspects it even can lead to reduced system performance: Using several clock cycles for the instruction fetch phase increases the number of branch delays,

5

additional clock cycles for the execution phase leads to increased load-in-use and define-in-use dependency [12]. For considering the trade-off between reachable clock frequency and system performance DSPxPlore is introduced in a later section. All instructions of xDSPcore can be assigned to three different operation classes: load/store instructions used to transfer data between data memory and register file, arithmetic instructions (including miscellaneous instructions like interrupt disable) and branch instructions influencing the program flow. Each instruction consists of one or two instruction words. The first three bits of the instruction word are used for operation class and alignment information. The parallel word (as illustrated in Figure 3) is used for long offset and long immediate values.

Figure 3: Instruction Coding.

Data copy functions between registers of the register file can be omitted by featuring 3-operand instructions and by an orthogonal register file. For preventing limitations during instruction scheduling, no mode bits are used. In difference to available core architectures where mode bits are used for increasing code density, all functions are coded in instruction words.

Figure 4: Parallelism.

To allow efficient use of the core resources the number of instructions executed in parallel can be adapted according to the application. The example architecture in Figure 4 allows the execution of two load/store, two arithmetic and one branch instruction in parallel. As already mentioned VLIW is used as programming model. Overcoming the drawback of traditional VLIW architectures which are featuring low code density, xLIW (a scalable long instruction word) is introduced. xLIW is based on VLES (Variable Long Execution Set) and additionally supports a reduced program memory port. xLIW is introduced in detail in [13]. The example architecture illustrated in Figure 1 features two busses to data memory. Two independent AGUs (Address Generation Unit) enable generating two independent addresses. Each of the AGUs can make use of all address registers, they are not banked. If the addresses generated in parallel access the same physical memory block, the hazard is detected during run-time and the memory operations are serialized. The AGU of xDSPcore supports all common DSP address modes, like memory direct, register direct and register indirect addressing. The auto inc/decrement address operation supports pre- and post address calculation and an efficient stack frame addressing. The size of modulo buffers for modulo addressing schemes is programmable; the start address has to be aligned.

Figure 5: Register Files.

6

Load-store architecture implies that all operands for the arithmetic instructions are fetched from the register file. The structure of the register file and number and size of registers is configurable; Figure 5 is used for illustrating the register file of the example architecture. The register file is split into three parts, a data register file, an address register file and a branch file.

Figure 6: Data Register File.

The data register file as in Figure 6 consists of 8 accumulators, 8 long registers or 16 data registers. Two consecutive data registers can be addressed as a long register. A long register including additional guard bits (for higher precision calculation, e.g. 8 additional bits) can be addressed as an accumulator. The register file is orthogonal, which means that each register can be used for each operation and none of them is assigned to a certain instruction or has a predefined functionality. The drawback of an orthogonal register file is the crossbar to enable mapping of the read and write ports to the registers. The address register file contains 8 address registers and 8 modifier registers. The modifier registers are directly connected to the corresponding address registers, therefore it is not possible to use, e.g., address register r0 with modifier register m7. The modifier registers are used for modulo addressing and for bit reversal addressing which can be used for optimizing FFT implementations.

Figure 7: Branch File.

The third part of the register file is the branch file. The branch file contains flags, reflecting the status of the core and the related register files. A separate branch file is used to refrain from adding more read and write ports to the data and address register file, which are already stressed enough due to orthogonality requirements. xDSPcore supports a rich set of conditions for the conditional branch instructions and predicated execution [14]. Predicated execution is used to reduce the number of branch instructions and therefore the number of unused branch delays [15], [16], [17], [18].

Figure 8: SIMD Cross Operation.

The instruction set of xDSPcore features 3-operand arithmetic instructions. Each of the arithmetic instructions is available for all supported data types (data, long, accu). SIMD type of instructions is

7

also used to increase code density and system performance [11]. xDSPcore supports integer and fractional data types. SIMD-cross instructions illustrated in Figure 8 allow reducing the number of data move instructions between registers. For efficient implementation of control code (e.g. framing), bit field instructions are included (insertion, extraction). xDSPcore features a combination of hardware and software stack. The hardware stack is used for automatically storing the program counter. For handling of the hardware stack no clock cycles or instructions are needed; the number of hardware stack entries can be configured. For recursive function calls a software stack is supported. A separate instruction is available to move the last hardware stack entry into register file, illustrated in Figure 9, and a software stack can be built up in data memory with regular load/store instructions. When returning from a function call, the jump-register-indirect instruction is used to preload the program counter with a register value.

PC

stack pointer

register file

pop

jump register indirect

Figure 9: Stack Concept.

Today DSP cores have to take care of efficient interrupt handling, which is characterized by low latency and small overhead during task switching. xICU (scaleable interrupt control unit) of xDSPcore supports low latency interrupt handling [19]. Priority morphing is added to prevent interrupt starving and to enable controlling of the execution order of interrupt service routines during run-time. The scheduling is taken care of by the operating system.

8

Design Space Exploration This section is used to introduce DSPxPlore, a design space exploration methodology for xDSPcore. DSPxPlore can be used to understand application specific requirements on the processor architecture already in an early stage of the project. At later stages DSPxPlore enables fine-tuning of the chosen core subsystem. The introduced methodology is not limited to xDSPcore. The first part of the section is used to discuss the design space of RISC based DSP core architectures while xDSPcore is taken for illustration. The influence of different configuration parameters on consumed silicon area and power dissipation of the core subsystem are briefly discussed. The second part introduces the exploration parameters used for application specific optimizations. The third part is used for illustrating the methodology DSPxPlore is based on, ending with some examples and results.

Motivation Making use of the additional degree of freedom from the parameterized core architecture, the requirements of the application have to be understood. Quite often the core decisions are done by the most experienced engineers focusing on the aspects �what is already available?� and �what has been already proven in silicon?� to reduce the risk. Different requirements of the applications lead to suboptimal solutions concerning consumed silicon area and power consumption by using one core subsystem. In the price-critical consumer IC market this can be crucial for the own market position and the revenues.

Design Space The design space of RISC based DSP architectures can be identified to a few parameters, which shall be briefly introduced in this subsection. A more detailed analysis can be found in [20].

• Register File The register file in load/store DSP architectures plays a major role. Arithmetic instructions are using entries from the register file as operands, and separate instructions are used for exchanging data between memory and the register file. Providing fewer entries leads to additional spill code, which decreases system code density. Spill code is used to copy entries of the register file to memory, to free space in the register file for new results.

Huge register files with many entries have significant influence on the core area. The register entries have to be encoded and therefore increasing the number of registers decreases code density. The requirement of orthogonal architectures to obtain efficient instruction scheduling does not allow banked register files with the consequence that the increased crossbar to the execution unit is limiting reachable core clock frequency.

• Data Paths The number of data paths available for executing instructions in parallel is influencing the reachable core performance. Dependencies (control and data) in the application code and micro-architectural limitations like reduced program memory ports (e.g. Carmel DSP [21]) are limiting system performance.

Providing several data paths in parallel influences system performance due to the crossbars between the register file and data paths. Adding additional data paths not only increases the core area but also decreases system code density by additional instruction space necessary for coding new instructions. Therefore a balanced relation between application-specific requirements and the number of parallel data paths provided is required, which allows efficient use of silicon area.

9

• Memory Bandwidth The memory bandwidth is influencing the possible usage of available parallel data paths. Featuring the execution of four MAC instructions in parallel, without providing enough data for executing the MACs is wasted silicon. Therefore a balance between the number of instructions executed in parallel and the memory bandwidth is required for efficient use of the core resources. That is true for both data and program memory ports.

• Instruction Set/Size/Encoding The instruction set architecture is a key feature influencing code density, and code density is a significant factor influencing the consumed silicon area of the core subsystem. Therefore application-specific instruction encoding allows efficient usage of the coding space. Application-specific reduction of the instruction set architecture enables to resize the instruction word. The switching activity at the program memory port can be reduced by adapting the binary coding of the fetched instruction words.

• Pipeline Structure The number of clock cycles spent for implementing the pipeline structure influences the reachable clock frequency and therefore the performance of the core subsystem. On the other hand, branch delays, load-use and define-use dependencies limit the utilization of the core resources. Increasing the number of clock cycles enables providing core architectures with a theoretical higher possible computation performance. If the restrictions in the application code do not allow making use of the theoretical performance provided, only the power dissipation has been increased (due to a higher clock frequency).

• Instruction Buffer The instruction buffer is a specific feature of xDSPcore. Also several commercial cores are getting a similar feature in their newer versions and therefore it is reasonable to discuss the aspect. The instruction buffer of xDSPcore is used to compensate the memory bandwidth mismatch between fetch and execution bundles in average, and for executing loop constructs power efficiently by preventing repeated fetch cycles to program memory. The number of entries limits the size of loop bodies which can be efficiently handled. On the other hand, too many buffer entries influence the size of the core and the reachable performance due to the crossbar connecting the buffer and the decoder ports.

Methodology Providing a configurable DSP core architecture for meeting application-specific requirements reduces area consumption and power dissipation. For identifying an application-specific core architecture it is important to understand the specific requirements of the application.

For this purpose DSPxPlore is introduced. DSPxPlore is not a tool, it is a methodology. The exploration methodology is based on an optimizing C-compiler and a configurable ISS.

Figure 10 provides an overview of the exploration methodology. The optimizing C-compiler is used to generate static analysis results. For evaluating dynamic results, the cycle true ISS called xSIM is used. Both results together can be used to analyze the application requirements for the core subsystem. The core configuration is described in an XML-based configuration file, introduced in more details in the next section. In the next sub-section some of the analysis results are introduced.

10

Figure 10: DSPxPlore Overview.

Analysis As illustrated in Figure 10 the analysis results are split into static and dynamic analysis. This sub-section is used to briefly introduce some analysis results. A more detailed description can be found in [20].

Static analysis Static analyses are generated by the C-Compiler. Some examples are

• Code size The consumed silicon area of a DSP subsystem is significantly influenced by the program memory and therefore by code density. Code density is a measure of how efficiently the application can make use of the chosen instruction set architecture and the related binary coding. Furthermore, micro-architectural limitations like the structure of the program memory port are mirrored in code density. The analysis result code size is normalized to bytes, which makes the results from different sized instruction words comparable.

• Parallelism Providing the execution of several instructions in parallel increases the theoretical performance of a DSP core. Control and data dependencies in the application code are limiting the actual usage of the parallel execution units. Therefore the analysis result parallelism is used to analyze how efficiently the application code can make use of the hardware resources.

• Instruction histogram Mapping of an instruction set architecture to the related binary coding has influence on power dissipation (switching activity at the program memory port) and code density. The analysis result instruction histogram lists the instructions used and the frequency of their occurrence. This result can be used to identify the most used instructions for optimizing binary coding.

Dynamic analysis Dynamic analyses results are generated by the cycle true instruction set simulator xSIM. Some examples are

• Execution count per bundle

It is typical for DSP algorithms that the central parts are implemented as loop constructs. Therefore the execution bundles (instructions executed during the same clock cycle) that are a part of a loop construct are executed frequently, which makes optimizations more significant than optimizations in less frequently executed bundles. The analysis result execution count per bundle is used to identify frequently executed bundles. The result is also used for weighting the static result parallelism.

11

• Execution count per instruction Instructions as part of loop constructs are more frequently used as instructions in sequential control code. The analysis result execution count per instruction identifies the execution frequency of instructions used. The result is also used for weighting the static result instruction histogram.

• Program memory fetch The number of program memory fetch cycles has significant influence on the power dissipation of the DSP subsystem. Especially control code featuring low branch distance can lead to significant number of fetch cycles, fetching instructions never being executed. To identify the unused fetch bundles and to optimize the fetch process at the program memory port the analysis result program memory fetch is used.

Example Results This subsection is used to illustrate some results generated by using the methodology described above. The application code used for generating these results consists of typical DSP code examples like FFT, cryptographic algorithms and control code examples. A more detailed description can be found in [20].

Register file The number of entries in the register file has influence on code density and performance of the core subsystem. Too many entries increase core area and decrease code density due to additional coding space needed. Fewer entries are influencing code density by the need of additional spill code.

Register File

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

4 8 12

number of register

nr. bundlesnr. instructionsnr. delay NOPscode size (byte)

Figure 11: Influence of the Register File.

The example in Figure 11 is used to illustrate the influence of the size of the register file to the core subsystem. The x-axis counts the number of register entries, e.g. 4 means 4 accumulator registers, equivalent to 8 data registers. The number of instructions needed to code the application is decreasing with an increasing number of register file entries (less spill code needed). The same is true for the number of execution bundles needed to execute the algorithm. The number of branch delay NOPs is not significantly changing. Therefore the code density normalized in bytes is decreasing with the number of provided registers. The influence of the size of the register file on code density and number of execution bundles is starting to saturate after 12 entries, as can be interpreted from Figure 11. Adding more register entries, e.g. 16 or 32, has no significant influence on system performance any more.

12

Data paths Providing the execution of several instructions in parallel can be used to increase system performance of a DSP subsystem. Control and data dependencies in the application code can lead to poor usage of the core resources.

parallelism

0

5000

10000

15000

20000

25000

30000

35000

40000

model 0 model 1 model 2 model 3 model 4

model 0-4

nr. bundlesnr. instructionsnr. delay NOPscode size (byte)

Figure 12: Influence of Parallel Data Paths.

In Figure 12 the influence of different VLIW models on code density and execution bundles is illustrated. Model 0 supports to execute 3 instructions in parallel, model 4 up to 6 instructions. The different models are described in Figure 13.

model 0 model 1 model 2 model 3 model 4

computation 1 1 2 2 3

load/store 1 2 1 2 2

branch 1 1 1 1 1

Figure 13: Model Description.

Increasing the number of parallel datapaths has influence on the number of execution bundles needed. A significant influence on system performance by adding parallel units is limited by dependencies in the application code. As illustrated in Figure 12, model 2 (a 4-way VLIW) provides a significant reduction of execution bundles needed, adding more data paths has less influence.

The increasing number of branch delay NOPs is caused by the compiler feature for scheduling instructions as early as possible; this can be changed. But anyway, wider VLIW architectures decrease branch distance and therefore increase the number of branch delay NOPs.

Pipeline structure Increasing the number of clock cycles per pipeline stage is used to increase reachable clock frequency of core subsystems. However, dependencies in the application code and branch delays can even decrease system performance. Branch delays are caused by taken branch instructions and lead to unused clock cycles. Supporting delayed branching allows making use of the branch delay slots, which is limited by application code.

The example in Figure 14 compares the execution on a 3-way VLIW architecture and on a 6-way VLIW core. Doubling the number of instructions which can be executed in parallel reduces the number of needed execution bundles by about 15%. However, reducing the number of execution bundles does not reduce the code size needed, which keeps the same. The advantages of the wider VLIW are compensated by additional branch delay NOPs, caused by the reduced branch distance.

13

influence branch delays

0

5000

10000

15000

20000

25000

30000

35000

40000

2 3 4

nr. of branch delays

nr bundles (model 0)nr. bundles (model 4)code size (model 0)code size (model 4)

Figure 14: Influence of Branch Delay Slots.

Increasing the number of branch delays decreases code density, as illustrated in Figure 14. The additional branch delay NOPs needed for branch delay slots which cannot be filled with useful code are decreasing code density.

Summary of examples The examples in this section were used for illustrating the complexity of measuring system performance. Changing one of the core architectural features has influence on several other aspects; none of them can be analyzed isolated.

The application code executed on the core subsystem has significant influence on the results. Therefore evaluation of core architectures has to be done with the real application code. Traditional benchmark examples as available for commercial core architectures can be misleading.

14

Configuration The DSP core architecture introduced before enables application specific adaptations of the main architectural features. Some exploration parameter are introduced in the section before, used to illustrate the design space of RISC based DSP architectures.

For easy maintenance of configurations and keeping the tools, core and documentation consistent, xDSPcore has a single common configuration file. Information concerning core details is available only once. This allows keeping tools, core and documentation consistent. In this section the configuration file is briefly introduced, together with the related tools making use of it.

Configuration File structure The configuration file for the xDSPcore architecture is based on XML. Using a standard language allows making use of existing tools especially helpful during documentation generation. The configuration file is split into several sections.

Architectural configuration The first part contains the chosen architectural configuration. The structure of the register file, the number and size of the different register types and the mapping of the registers is described. The chosen pipeline structure and the number of clock cycles for each pipeline stage is mentioned. An example for this section of the configuration file is illustrated in Figure 15.

Figure 15: Register Configuration.

The example in Figure 15 is used to define a register as illustrated in Figure 6. The accumulator with the name A0 is defined as a register with 40 bits word length. The long register L0 is defined as a shared register; as illustrated in Figure 6 L0 is part of A0. L0 is 32 bits wide, starting at bit 0 and therefore the MSB (most significant bit) is located at bit position 31. Sign extension is done automatically. To fully describe one accumulator register as introduced in Figure 6, the data register has to be defined (illustrated in Figure 16). For example D1, equal to the high word of L0 is again a shared register, 16 bits wide, located between bits 16 and 31. Similar to long register, sign extension is done automatically. Data register D0, the low word of long register L0 is defined in a similar way. The only difference to D1, besides the position of the data word, is the configuration parameter

<config type="register" name="A0">

<data name="width" value="40"/>

</config>

<config type="shared_register" name="L0">

<connect method="getValue">

<hook method="getValue" name="A0"/>

</connect>

<connect method="setValue">

<hook method="setValue" name="A0"/>

</connect>

<data name="width" value="32"/>

<data name="msb" value="31"/>

<data name="lsb" value="0"/>

<data name="sign_extension" value="1"/>

</config>

15

sign_extension, which is set to 0. Changing the entry of D0 does not automatically change the leading bits of the accumulator.

Figure 16: Data Register Definition.

Similar to the example of register definition and register file structure, the configurable architectural features as introduced in the section before are described in the configuration file.

Instruction Set description For keeping the core application-specific it is possible to add, remove and change entries of the instruction set architecture. For obtaining lower switching activity at the program memory port the binary coding can be changed. For this reason the instruction set architecture and instruction set are part of the configuration file.

For illustration, the definition of a three-operand addition of data registers is shown in Figure 17. The instruction is defined and version numbering allows evolutionary modifications of the configuration file. The binary coding consists of fixed bits and some parameters. In the example below parameters are used for coding of register entries. The assembler mnemonic is introduced and a name specified for the documentation. Operands, their order and the chosen syntax are defined below. The section execution cycle is used to map functions of the instruction (e.g. reading operand1 from register) to pipeline stages. This mapping is defined in a section of the core architecture and mapped to the instruction at this stage.

The restriction, description and operation tags are used for the documentation part of the instruction.

<config type="shared_register" name="D1"> <connect method="getValue"> <hook method="getValue" name="A0"/> </connect> <connect method="setValue"> <hook method="setValue" name="A0"/> </connect> <data name="width" value="16"/> <data name="msb" value="31"/> <data name="lsb" value="16"/> <data name="sign_extension" value="1"/> </config> <config type="shared_register" name="D0"> <connect method="getValue"> <hook method="getValue" name="A0"/> </connect> <connect method="setValue"> <hook method="setValue" name="A0"/> </connect> <data name="width" value="16"/> <data name="msb" value="15"/> <data name="lsb" value="0"/> <data name="sign_extension" value="0"/> </config>

16

Figure 17: Instruction Definition.

All instructions supported by xDSPcore are located in the instruction set description. In the next sub-sections some �tools� are introduced, making use of the configuration file.

Decoder generator The configuration file is used to automatically generate the VHDL description of the instruction decoder. Considering the possibility of a change in the binary coding of the instructions and the related coding and verification effort, a tool like this is required.

Figure 18 is used to briefly introduce the structure of the decoder generator. The central block is the database. The database is fed with the instruction set and the chosen binary coding. A spreadsheet is used for better visualization of the instruction set during definition. The instructions are described in instruction groups and symbols are used to identify the instructions bundled to instruction groups.

The second input is the actual register configuration. In [22] a separate part of the spreadsheet was used. The final version of the decoder generator is using the content of the configuration file.

<instruction name="ADDITION_DATA_REGISTER" type="instance" version="0.1.0"> <var_info> <opcode>10p00000aaaabbbbcccc</opcode> <execution_model/>  <image/> </var_info> <const_info> <mnemonic>add</mnemonic> <name>ADDITION (DATA)</name> <instruction_type>CMP</instruction_type> <operands> <operand char="a" order="1"> <operand_type>DATA_REGISTER</operand_type> </operand> <operand char="b" order="2"> <operand_type>DATA_REGISTER</operand_type> </operand> <operand char="c" order="3"> <operand_type>DATA_REGISTER</operand_type> </operand> </operands> <syntax> <define_value>op1, op2, op3</define_value> </syntax> <execution_cycle> </execution_cycle> <restrictions>None</restrictions> <description>Performs an addition of two data register values (op1,op2) and stores the result in a data register (op3). </description> <operation><description><para>op3 = op1 + op2 </para></description></operation> </const_info> </instruction>

17

Figure 18: Decoder Generator Structure.

The third input to the database is called container. The container is a C++ class, which is used to map the instruction groups to VHDL statements. Therefore for major changes in the instruction set like adding new instruction groups, changes in the container structure are getting necessary. Adapting of the existing instruction set does not require any changes in the container structure. Examples of container classes can be found in [22].

Building up the database in Figure 18 the three input sources are read and a consistency check prevents ambiguous coding of instruction groups. All instructions are mapped to instruction groups, the instruction groups are mapped to container classes. A consistency check is used for preventing missing instructions in the VHDL description of the instruction decoder.

A tree structure is used to map the op-code of the instructions generated by the content of the database. Each node in the tree can take three status (don�t care, zero or one) which can be used for the symbols of the instruction words. Each branch in the tree is representing an instruction group.

Figure 19: Decoder Code Example.

During output generation the status of each node is checked starting at the root of the tree. Nodes with don�t care status are being handled first, because the related coding bit could be possibly used in a different instruction group. Reaching the end of the branch without further sub-branch connections the remaining coding of the instructions can be handled in one statement. If further sub-

case instruction (11) is when '1' => -- ADDI_Family cmp_instruction := addi; cmp_ex1_add1.en := '1'; cmp_ex1_add1.add_const := '1'; cmp_ex1_write1 := setDxLxRx(instruction(4 downto 0)); cmp_ex1_read1 := setDxLxRx(instruction(4 downto 0)); cmp_ex1_cntrl1.const := signExtend16(instruction(10 downto 5)); when others => -- MOVR_Family cmp_instruction := movr; case instruction(10 downto 8) is when "000" => cmp_ex1_write1 := setData(instruction(3 downto 0)); cmp_ex1_read1 := setData(instruction(7 downto 4)); �

18

branches are found, an additional split in the coding (implemented as case statement) is getting necessary.

In Figure 19 a code example for generated VHDL-RTL code can be found. The control signals for the DSP core units are set and the register coding is assigned. The decoder architecture is separated into decoder structures for each datapath, and assigned to one of three operation classes (load/store, arithmetic and branch). The described procedure has to be done for each of the datapaths. Splitting the decoder in sub-decoder structures enables to remove a sub-decoder if the related datapath is not included in a chosen core architecture. Therefore silicon area consumed by the instruction decoder scales with the number of supported instructions executed in parallel.

Tool chain As an example of making use of the configuration file, xSIM the cycle-true instruction set simulator of xDSPcore is chosen. Different to available simulator concepts introduced in [23][24][25] and [26] xSIM is based on a framework supporting run-time configurablility and supports VLIW architectures. Both are mandatory for using an ISS for xDSPcore.

The framework is based on components. Each component consists of classes, in the sense of object oriented programming. The component needs input, provides a certain predefined functionality and produces some output. The communication between components takes place via hooks. The components are observable, which means that the function of a component (e.g. the GUI � graphical user interface) can be activated by another component (e.g. by changing a register value). Configuration parameters of the components have to be predefined and can then be changed during run-time (e.g. changing the word length of a register).

Figure 20: Simulator Generation.

Dynamic link libraries are used to provide components which allow any user to add own type of components following the predefined structure. Different components can be used to set up a new component. Also the simulator itself is a component. The configuration information stored in the XML based configuration file will be evaluated during starting up of the simulator (run-time configurable). In Figure 20 the steps during startup are illustrated. First reading the model configuration provides the information which components are used and how they are connected to each other. The instantiated components are read from a dynamic component library. Already during this time, the framework checks the connection between the components and verifies their features. If the framework passes this stage without error messages it is guaranteed that the described model consists of properly connected components.

In Figure 21 a layer model of the simulator to simulate the different architecture variants is illustrated. The base layer reflects the functional model of the core architecture. Further layers can be added to simulate different aspects e.g. the power dissipation. The higher layers are again connected with hooks to the base layer. The layer-based structure allows different abstraction levels and therefore different simulation speed. Starting with a base layer, describing the chosen architecture and verifying on functional level can be done with high simulation speed. Adding layers will cause a decrease in the simulation speed.

19

Figure 21: Layer Concept.

To obtain execution platform independency the GUI is a separate layer. Supported interface libraries are GTK and QT.

Documentation The configuration file is used as source for automatic generation of documentation. E.g. for changing binary coding of instructions an automatic update of the documentation is necessary. This reduces effort for changing and validation. The instruction definition as the example in Figure 17 is used as source for updating the core documentation.

The same source is also used for, e.g., generating the instruction set description and the assembler documentation. Therefore documents describing the core architecture are consistent without the need of manual verification.

Figure 22 is used for illustrating the chosen flow. Using the XML core configuration file and a style sheet information as input, an XSL processor (saxon [27]) is used for generating a DoCBook File also based on XML [28]. Depending on the chosen output format e.g. directly an html description can be generated (again using saxon and style sheet information this time for DoCBook to html) or via an intermediate step (making use of the Apache FOP (Formatting objects processor) [29]) a .pdf version of the documents can be generated.

Figure 22: Generating Documentation.

Making use of the DoCBook intermediate representation enables to use only one style sheet for different output formats which prevents the need to keep them consistent [30].

20

Summary This paper introduced xDSPcore, an application-specific configurable DSP core. The main architectural features with influence on area and power dissipation can be scaled or configured to close the gap between dedicated hardware implementations and software-based solutions. Therefore xDSPcore is well suited as a platform core. To identify the application requirements onto the core subsystem, design space exploration methodology DSPxPlore is introduced which allows exploring the design space of RISC based core architectures. Some results are illustrating the complexity of changing parameters and the influence on the key aspects area and power consumption. To obtain consistency between core hardware, programming tools and documentation, a unique configuration file based on XML is introduced and some of the �tools� making use of it are briefly illustrated.

Acknowledgement The work has been supported by the EC with the project SOC-Mobinet (IST-2000-30094), the Christian Doppler Lab �Compilation Techniques for Embedded Processors� and Infineon Technologies Austria.

References [1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, �DSP Processor Fundamentals, Architectures and Features�,

New York, IEEE Press, 1997. [2] ISO/IEC DTR 18037.2, �Information technology � Programming languages, their environments and

system software interfaces � Programming Language C.� Version for DTR approval ballot, 25.04.03. [3] �SC140 DSP Core Reference Manual�, Motorola Inc., MNSC140CORE/D, Revision 3, November

2001. [4] �Blackfin DSP Instruction Set Reference�, Digital Signal Processor Division, Analog Devices Inc.,

First Revision, March 2002. [5] ARC 600, Reference Manual, http://www.arc.com. [6] �Xtensa Architecture and Performance�, white paper of Tensilica Inc., September 2002. [7] Coware, http://www.coware.com. [8] �Lisa 2.0 Language Reference Manual, Manual RM_2002.02�, LisaTek, February 2002. [9] �The ArchC Architecture Description Language v0.8.1, Reference Manual�, www.archc.org, 2004. [10] �DSP specific extension to ANSI C�, ACE, http://www.ace.nl/products/cosydsp.htm. [11] J. L. Hennessy, D. A. Patterson, �Computer Architecture. A Quantitative Approach�, San Mateo CA,

Morgan Kaufmann Publishers, 1996. [12] D.Sima, T.Fountain, P.Kacsuk, �Advanced Computer Architectures: A Design Space Approach�,

Addison Wesley Publishing Company, Harlow, 1997. [13] C.Panis, R.Leitner, H.Grünbacher, J.Nurmi �xLIW � a Scaleable Long Instruction Word�, ISCAS 2003,

Bangkok, Thailand, 2003. [14] C.Panis, U.Hirnschrott, A.Krall, G.Laure, W.Lazian, J.Nurmi,� FSEL - Selective Predicated Execution

for a Configurable DSP Core�, IEEE Annual Symposium on VLSI, Lafayette, LA, USA, 02. 2004. [15] Smith J.E., �A study of branch prediction strategies�, in Proc. 8th ISCA, pp.135-48, 1981. [16] Lee J.K.F. and Smith A.J., �Branch prediction strategies and branch target buffer design�, Computer

17(1), pp.6-22, 1984. [17] Yeh T.-Y. and Patt Y.N., �Alternative implementations of two-level adaptive branch predictions�, In

Proc. 19th ISCA, pp.124-34, 1992. [18] Pnevmatikos D.N. and Soshi G.S., �Guarded Execution and branch prediction in dynamic ILP

processors�, In Proc. 21st ISCA, pp. 120-9, 1994. [19] C.Panis, J.Hohl, H.Grünbacher, J.Nurmi, �xICU - a Scaleable Interrupt Unit for a Configurable DSP

Core", SOC-03, Tampere, Finland, 11.2003.

21

[20] C.Panis, U.Hirnschrott, G.Laure, W.Lazian, J.Nurmi, �Design Space Exploration for an Embedded DSP Core�, SAC 04, Cyprus, 03.2004.

[21] Infineon Technologies, �Carmel DSP Core Architecture Specification�, Infineon Technologies, 2001. [22] C.Panis, A.Schilke, Habiger, J.Nurmi, �An Automatic Decoder Generator for a Scaleable DSP

Architecture�, Norchip, Copenhagen, Denmark, 11.2002. [23] M.Rosenblum, E.Bugnion, A.Herrod and S.Devine, �Using the SimOS Machine Simulator to Study

Complex Computer Systems�, ACM Transactions on Modeling and Computer Simulation, Vol.7, 01.1997. [24] K.Shadron and S.A.Pritpal, �HydraScalar: A Multipath Capable Simulator�, Newsletter of the IEEE

Technical Comitee on Computer Architecture (TCCA), 01.2001. [25] J.Emer, P.Ahuja, E.Borch, A.Klausner, C.Luk, S.Manne, S.Mukherjee, H.Patil, S.Wallace, N.Binkert,

R.Espasa and T.Juan, �ASIM: A Performance Model Framework�, IEEE Computer, 02.2002. [26] T.Austin, E.Larson and D.Ernst, �SimpleScalar: An Infrastructure for Computer System Modelling�,

IEEE Computer, 02.2002. [27] Michael H. Kay: SAXON The XSLT and XQuery Processor saxon.sourceforge.net. [28] Norman Walsh, Leonard Muellner: Docbook: The Definitive Guide, O�Reilly, October 1999

(www.docbook.org). [29] Apache FOP: XSL-FO Processor xml.apache.org. [30] Norman Walsh: DocBook XSL Stylesheets docbook. sourceforge.net.

22

PUBLICATION 2 C. Panis, R. Leitner, H. Grünbacher, J. Nurmi, �xLIW � a Scaleable Long Instruction Word�, in Proceedings The 2003 IEEE International Symposium on Circuits and Systems (ISCAS 2003), Bangkok, Thailand, May 25-28, 2003, pp. V69-V72.

©2003 IEEE. Reprinted, with permission, from proceedings of the 2003 IEEE International Symposium on Circuits and Systems.

��

��

��

��

��

��

��

��

��

�� !� ��"��"��#�$$%�%��

��

�� " ��&��"��'(�� ) �� *�� +�� ,)))� -'.�� /�� +�� -'.�� ,�� ,�� )� �� *�� )� �� ,�� "� &��)� �� ,��)+�)��/� �� 0�1�2��"�0�� 1��)3�� ,��)��)/.�� ) ��&��))� �� "� �� "��+�� ,�� )��/��!�� '��,��0�� 1��)�2&0�13/��&0�1��+ �� !��,�� +�� )� )� ��"/�� ,�� ))� �)�� +�)� �� "� �� )�� )��/� �� &0�1� �� )!��4��,��-'./

�� *�� +�� ,)))�� !��)��)�� )�� &�� /�� 0�1�2��"�� +��)35%6� +�� ,� � )7� +�� ,�� !�� +��) /� �� )�� )� �� )�� )��)�� ,�+�� )�� +��2/�/� �� 3/��)+�� 8� ��,��) � 5�6� �� !��,�� 0�1�� /��0�1�+��,��)7�)��))��)�� )� �� &�� /� .��!�)�� ,�� ) ��+�)��"� �� "� �� +� �� !��,�� ,� � )� �� /� �� !�� &0�1� 2'��,�� 0�� 1��)3/� �� &0�1� �� + � ��

��,��"� �� !��,�� &�� +�� )��"�� 97�+�� ,��,)))� �� 2��)�� )�� +�)� �� )� �� 3/� �))��"� �� 8 � �� :�;� �� )�"< � -'.� �� )�)��)�5$67�+��)� ��+��"��9��!��,�� /

��)��&0�1 ��)�� )��&�� )��)/(�� ,��)�� ,��8 7� �� ,��7� ��+ ��+�� 9)� ��)�� /� �� )�� &�� )�� &0�1/�� )�� +�� ,�)� �� )/��,��)�� ,��8 =� �� ,��7� �� )��+��,�&��)��)��/

instr 1

in str 3

in str 6

in str 2

in str 5

in str 7

unit 1 un it 2 unit 3 unit 4 unit 5

in str 4

instr 1

instr 5

instr 2

instr 6

in str 3

in str 7

in str 4

a lignun it

��%=��0�1��

!� ��"#��&�� ! ��!�!�+�� &0�1��/�� )� �� )!�� )� ��7� �� &�� +�� ,� � )/�� )��)� -��>��!��)� ��)? �� -'.� �� !� ��)��)�� )�� )� ��%/

�� ,��+�)�� +��)/��,�� ,�� +�� $��)� �� /�))��"�� ,� ��%7��,��)�� +��) 7� +�� ! � �� :�� ,��

V69

+�)��"�,� �2�� ,�"��9��+�� %@� ,�� +�)� ,� 3/� �� ,�� +�� ,� � )� �� +��)� �� )��))��"�� /� �� "� �� )� �� ,��)�� ))� 7� +��+ � �!�� "� � �� 7� � � �� )� ��%/

�� +�� 7� ��!� �� ))��+��)/�� +��)� �� ,� � )� �� !�� 7��)��!�� ,��))� �+��)�� +��)�� /�� !��,�7� /�/� �� "� �� +��)/� ��!�))� �))� � �� ,�� %�� ,�� +�)/� �� +��+��)�� !��,��))��))� � ��/

ins truction

instruction long wo rd

no rm al ins truc tion

long ins truction

��=�� 1��)

�� +��)� )� � �� )� �� ,�� ,�� /�� )��"� �� +��)� ��+ � �� +��)�� )��,��)�/

��&0�1�� ,� ,�� !�� +��) � � ) ��,)� ,��/� �� ,�� +��) ��,��)��&0�1�� "�)��)�� )� �� "� �� )+�� 2+�� &��,��!��,�� 3/��+��+�+�� &0�1�� 7�+��)�� &��)� �� /� �&�� ,��)�� 9� �� !��"� �� 7�� +��$/

1 ins truc tion word

10 ins truc tion word s

��$=�&0�1� ��

�� )� �� "� �� +��,��)�� /�� 9��,��)�� 7� �� &�� ,��)�� +��) /

$� ��%�� )�� ))�� &�,��"� �� '��,�� 0�� 1��)� �*�� ))�� )� �� /� �� +�� +�� ,� � )� �� )�� /��+�� +�� +�� ),�+�� 7� �� +��) �� "7� ��)� �� )��)� ��7� �� ,�� )��)��/

�� + ��!�!�+��,��)��,��8 ��7�+��+��,�) ��,)� �� &�� )��/

instruc tionbu ffe r

instruc tionm app ing

align contro l

unit 1

unit 2

unit 5

��=��

$�� &&�� 9� �� ,��)�� 2�� +��) 3��)� ��&��,��)�� !��"�,�+��%� ��)%�� +��) /� �0�1� �� !� ��)+�� 8� ��,��) � 5�6� �� )� �� !� )��)��)�� /� �� )�� +�� ,� )�� +�� 2/�/�,"�� 3/�� +�"� �� "�� +�� &�� ,��)�� +�� ,� &��)� �� /�� 9��&��,��)��&�) �� 9��,��)�7�� "��+��,�� "/

��!�� "�� 7�� ,�� )��)�� ,�� "� ,��)+�)�� ,�+�� 9� �� ,��)�� )� �� 9� �� &��,��)�/�� ,�� ,��5�6�� +��/

�� ,��)�� +�� ,� ��)� �� )� �� ,��7� �� )� �� &�� ,��)�+��,�,��)"��)�� +��) /�� )��)!�� ,�� ,��"�� )�� )� �� ,�� +�� "� �))�� "�� "7� +�� "�)�� +�� )� �� "

V70

�� /� A��!�"7� �� ,�� /

w rite enab le 0 write enab le 1 write enab le n

P RAM [FC ]P RAM [FC +1]P RAM [FC +2]P RAM [FC +3]

��=�� B��

�� ,�� ))�� *��)7� �� 7� +��+ � �!�+�� ,�� /� �� )+�� ,��)��)�)�� )��)� ��7� ��,��+��,��8)/�0�� )�)�� +�� 2/�/� +�� 3� �� ,� ��")��)/�� !��,��+��,�� )��8�� ,��"/

�� 9� �� ,�� +�� 9� �� &�� ,��)�� 7� ��)�!��,��)�� 9�2 ��@3/

��@=��'�9

�� &�� !�� ,��)� 7+�� +�" � �� +�� &�� ,��)�/��"+�"� �� ))� �� ,��7�+��,� ��)�� )��)��)�� +��)� ��"�� /

$�!� �� '�((�� ,�� )�� 2��7� 0��)?'��7� B��3/� �� ,�� +��) ��)��)��/

��!�+��,��&�� "� �� ,�� "� �� )��)��/� ��"� *�� 2�� ,�� +�� 3� � � �� ,�� )�� /�� C� �� &�� )/%@� �� +��) � ��!� �� ,� ��"9)� ��)� �� %�� +��) �+��,��+��))��)��)�� /

ins truction bu ffe r

decoder un it

ins truction address

(16x20)(10x20)

(10x4)

��C=�� D��

$�$� � �� +��,�� )��&�� )� �� )�� )�� +� �� "��)��)�� /

�� ,�� !�� &�� ,��)� 2�� "�� )� �� 7+�� 9��&��,��)�� 9� �� ,��)�3/� �� ,��)� � +�� ,��"9)7� �� )��+�� &�� &�� ,��)�� )��)��+��"�� +��) ��))/

bund le n

bund le n+1

bund le n+1

bund le n+1

bund le n+2

bund le n+2

bund le n+3

bund le n+4

curren t instruction bund le

next ins truc tion bundle s ta rt

��:=�A&��B��)�� )�� B��

�� :� �� )�� &�� &�� ,��)�� )� �� ,��/� ��2�!��,�� +��)3+�� ,� � )� �� )��"� �� &�� ,��)�/� �� 7��"�+��,� �� !�� /��+�� )� ��,��,��)�,��)�/� �� &�� ! � �� " � �� /

�� +��,��"9)��)�� +��,� ��)��,�� /� �� +��,� ��)��/�� ,��)7��+�&��,��)��+��

V71

�� /� �� )�� 2 ��3 � �� ,��)�� )�� ,��/��!�� ,�� )�� /� �� )�� &�� ,��)�� )"� ��"� �!��,�� )�� ,��)��,�&��)/

bund le n

bun d le n +1

bun dle n +1

bun dle n +1

bun dle n +2

bun dle n +2

bun dle n +3

bun dle n +4

current in s truc tio n bu ndle

next ins truc tion bun dle s tart

bun d le n +4

bun dle n +5va lid e nd

va lid e nd

va lid e nd

va lid e nd

��=�'��B��

�� )�� 7� �� )�� "�� 7��+��&��,��)�� !��,�/�� )��)��+/� >�+!�7� �� ,�� 8 ��7�� 8��)�� "�� ,�/

��%�=�B��(��9��

��9�� 8�� 5�6� �� ,�� ,�)� ��!�� "�� /�� %��&��,��)�� +�� +��) �2��"�,��8 3�� !�� +�� ,��)� /� �� &�� ,��)�� )�� ,�� 7� �� "��+�� ,�� "� �� )��,��)��,��&��,��)�� "� �!��,�� ,� � �)� �� )��)�� /��,�� 2��)��)��%��+�� +� �� )3� ��8 � �� "�� "/� 1�� ,��)�� ,�� ))� � �� &�� ,��)�� !��,�� )� �� ,��/

A�� +��)� �� ,�� ,�� /�� +��,�� )�� )��)��7� +�� ,� � �� )� )��)�� /� �� )��)��

��)��+�� )��)� �� )��)��)��/

�� "� &�� 7� +�� ,�� !� �� /��!�� ,�� 7��)"� )�� +�� ,)��)� ��)� ��)� �� )� �� ,�� +�� )�� /�-��)��)�� 8��!� �� )�� +��)� � � �7� �� +��)� +�� ,� ��)� �� +��)��)��)��/

�))�� ) �� &��"� �� )��)7��)�� )�� /� A/�/� �� +�� !��,�/�A��&��,�� )��D�� --� �� 7� ,�� 8��"�� /� �� 8��)��"� �� ,� �� )�)� ��)"� )�� /

)� � �� &0�1� 2'��,�� 0�� 1��)3� ��,� � �� 9�� &�� !�))� �� 0�1� �� /� �� +��) +��"�,�� )��&�� +��,�� )��"��)��#(.�� "�� )�/��)�)� ��"� ��0�1�� ,� ��)� +�� !��,�� +�/� �� ,�� ))� �� ,� � )� �� "� �)�� +�� )� �� "�� )��)��/��&0�1�� )!��4��,��-'./

*� ��&��5%6 ./� 0�� "7� E/� B��7� �/� '�� )� A/� �/� 0/� DSP

Processor Fundamentals, Architectures and Features��

�� !""�� #��$�%��&$��'��(�) �*��%��

�+� ,��%��"�,��-��'��. $�� /��*� �� (��$�� (�0�(�1��. $�� 222

�3� ,��4��5��%��"��!�� Computer Architecture.A Quantitative Approach�0 �(��/��&$��'��0�� )!��6�

�7� ,�� 8�� 4�.�� 0 �(�� /��&$��'�� )!��222�

V72

PUBLICATION 3 C. Panis, R. Leitner, H. Grünbacher, J. Nurmi, �Align Unit for a Configurable DSP Core�, in Proceedings on the IASTED International Conference on Circuits, Signals and Systems (CSS 2003), Cancun, Mexico, May 19-21, 2003, pp. 247-252.

©2003 IASTED. Reprinted, with permission, from proceedings of the IASTED International Conference on Circuits, Signals and Systems.

ALIGN UNIT FOR A CONFIGURABLE DSP CORE

Christian Panis Carinthian Tech Institute

A-9524 Villach Austria

Raimund Leitner Infineon Technologies


Herbert Gruenbacher Carinthian Tech Institute


Jari Nurmi Tampere University of

Technology FIN-33101 Tampere Finland

ABSTRACT Increasing system complexity of SOC applications leads to an increasing requirement of powerful embedded DSP processors. To increase the performance of DSP processors the number of parallel executed instructions has been increased. To program the parallel units VLIW (Very Long Instruction Word) has been introduced. Programming the parallel units at the same time leads to an expanded program memory port or to the limitation that only a few units can be used in parallel. Traditional VLIW architectures feature poor code density and therefore high area consumption of the program memory. To overcome this limitation the paper describes the align unit, which allows using unaligned program memory without any limitations on the core performance. The architecture, some implementation details and the influence on area consumption and power dissipation of the align unit are discussed in this paper. The align unit is part of a development project for a configurable DSP.

KEY WORDS VLIW, unaligned program memory, configurable DSP, align unit

1. INTRODUCTION

The increasing requirements on computational power of embedded processors have led to deeper pipeline structures and to an increase of parallel execution units. To program these parallel units the VLIW (Very Long Instruction Word) [1] will be used, which is built up of several instruction words. VLIW architectures have no hardware support like scoreboards [2] (common in super-scalar processor architectures) for the resolution of data dependencies between the instructions.

The VLIW will be fetched, decoded and then issued to the parallel execution units. At traditional VLIW architectures, a high number of parallel units lead therefore to a wider program memory port. To overcome this problem the paper describes the architecture of the align unit. The align unit enables the usage of unaligned program memory to increase the code density and therefore to reduce the area consumption and power dissipation caused by the program memory. In the first part of the paper different DSP concepts are discussed with focus on the micro-architecture of the program fetch unit. The second part is used to explain the architecture of the align unit in detail and to analyze the influence of loop handling and the execution of branches and interrupt service routines on the align unit. At the end some realization results are discussed.

2. INSTRUCTION ALIGNMENT

This section is used to explain the motivation for using unaligned program memory. Available DSP concepts are discussed with focus on the micro-architecture of the instruction fetch unit. At the last subsection an example architecture will be shortly introduced to explain the function of the proposed align unit.

2.1. Single Instruction Execution Assuming DSP architectures fetching and executing one instruction per cycle (e.g. the OAK DSP from DSPgroup) the memory port to the program memory has the same size as the native instruction word size. Figure 1 shows the OAKDSPCore from DSP Group. Inc. supporting a 16-bit address bus PAB and a unidirectional data bus PDB, also 16 bits wide [3]. One instruction per cycle is fetched and then executed. The direct memory architecture of the OAKDSPCore allows encoding of two data memory fetch-operations and the arithmetic operation within one instruction word (CISC instruction set).

Proceedings of the IASTED International Conference CIRCUITS, SIGNALS, AND SYSTEMS May 19-21, 2003, Cancun, Mexico

391-084 247

OAKDSPCore

PAB

PDB

Figure 1: OAKDSPCore

Therefore, even for MAC (Multiply and Accumulate) operations only one instruction word is needed.

2.2. VLIW Instruction Word

The increasing complexity of the applications using DSPs requires more powerful DSP architectures. One way to increase the performance of DSP architectures is to increase the number of parallel executed instructions. A VLIW architecture consists of multiple execution units performing multiple instructions in parallel.

TI C62xx

PAB

PDB

EP1

EP2

Figure 2: VLIW TI C62x

In Figure 2 the C62x from Texas Instruments is used as an example. With the C62x each clock cycle 8 instruction words (each 32-bit) can be fetched. Assuming data dependencies between the instructions of the application code it will not be possible to use every unit each cycle in parallel. The unused instruction space inside the VLIW has to be filled with NOP instruction words. This example illustrates the problem of traditional VLIW architectures. The code density of the application is quite poor due to the NOP instructions. On the other side reducing the number of fetched instruction words per cycle will lead to a reduced use of the available data paths. The C62x of Texas Instruments provides the possibility to split the fetch bundle (instructions that are fetched during the same clock cycle) into execution packages (EP) [4]. One fetch bundle can contain several execution bundles, which reduces the need of NOP instructions to align the execution bundle at the border of the fetch bundle. Carmel DSP of Infineon solves the problem with CLIW, a Configurable Long Instruction Word [5]. The program memory port is two native instruction words wide (2 times 24-bit). If more than two instructions have to be executed in parallel, an extended program memory port is available, providing additional 96 bits for instruction coding.

In the SC140 Starcore a prefix and a serial grouping mechanism is available to identify the instructions executing in parallel during the same clock cycle [6].

2.3. xLIW Instruction Word

The align unit described in this paper is part of a configurable DSP concept. Instead of a traditional VLIW architecture xLIW, a scaleable long instruction word is used. The native instruction word size is 20 bits, and long offsets or immediate values are coded in an additional 20-bit word. Each clock cycle four instruction words can be fetched, therefore the size of the fetch bundle is 80 bits, as illustrated in Figure 3.

fetch bundle

80-bit Figure 3: Fetch Bundle

The size of the execution bundle (the execution bundle contains the instructions executed in parallel) depends on the number of available units and of their possible usage (depending on the executed algorithm). In the implemented version five parallel units are available and therefore an execution bundle can have a size of one up to five instruction words. In Figure 4 examples of possible execution bundles are shown. The first one consists of only one instruction word and the second one assumes the maximum value of parallel executed instructions, each of them containing an additional word for offsets or immediate values.

execution bundle20 bit

200 bit Figure 4: Execution Bundle

To increase the code density, unaligned program memory is supported which means that the beginning of the execution bundle not have to be equal with the beginning of the fetch bundles. Assuming an execution bundle with more than four instruction words, the execution bundle has to be spread over several fetch bundles. In the example of Figure 5 the execution bundle consists of 10 instruction words (the grey fields), which are spread over four fetch bundles (the rows).

248

Figure 5: Unaligned Program Memory

VLIW architectures do not contain hardware support for resolving data dependencies between the different instructions and therefore the scheduling is done in software (e.g. by the compiler) [7]. Before the instructions fetched from the program memory can be executed, the execution bundle has to be aligned. The align unit explained in the following section will do this.

3. ALIGN UNIT

This section is used to illustrate the architecture of the align unit. The align unit consists of a fetch unit with an instruction buffer used for fetching instruction words from the program memory and storing them into the buffer. The align phase is used to set up the execution bundle and to map the instructions words to the decoder ports. Additionally the impact of hardware loops, branches and interrupts on the architecture of the align unit is discussed.

3.1. Pipeline structure

The proposed DSP architecture has a RISC-like pipeline based on three phases, instruction fetch, decode and execute. The three phases can be split over several clock cycles to increase the reachable core frequency. The align unit is part of the instruction fetch phase as illustrated in Figure 6.

Figure 6: Pipeline Structure

In the example architecture, the instruction fetch from the program memory and the align process are each executed during one clock cycle.

3.2. Fetch phase

Each fetch bundle contains four instruction words, which are fetched from one physical memory address. The fetched values are stored in the instruction buffer. The instruction buffer is organized as a circular buffer, built up of several buffer cells as shown in Figure 7.

physical addressphysical address EE VV

019

Figure 7: Buffer Cell

The four instruction words are stored in the buffer cell, and the related physical address is stored in the related address field. The address field contains two additional bits, the executed bit E and the valid bit V. Each time when a buffer cell is loaded with instructions from the program memory, the valid bit is set to one. That indicates that the buffer cell already contains fetched data. During the align phase, the executed bit is set as soon as the buffer entry has been used for execution. If there is no space left inside the instruction buffer for storing further program data, the buffer cells with an executed bit set to one are allowed to be overwritten. Details and exceptions of this rule will be discussed in subsections 3.5 and 3.6. In Figure 8 an example for an instruction buffer is shown. It is built up of buffer cells as shown in Figure 7. A fill pointer FP is circulating and used to assign the next free buffer cell. Already during the fetch phase the parallel word detection takes place. As pointed out in subsection 2.3, some of the instructions can have an additional parallel word associated to them. Such long instructions have to be decoded with the additional word included. Therefore, a first pre-decoding takes place during storing the fetched instruction words in the buffer cell, to detect parallel words. If an instruction has been identified containing a parallel word, an indication bit beside the instruction word in the instruction buffer is activated. This indication bit is used during mapping of the instruction words to the decoder ports, which takes place at the end of the align phase. At the beginning of the fetch phase the fetch counter is compared with all physical address entries stored inside the instruction buffer (only entries with activated valid bit are considered). If the comparison gives a positive result the fetch bundle is already stored inside the instruction buffer.

249


019


019


019

FP

fetch counterfetch counter

+

+

fc found ?fc found ?

Figure 8: Buffer Fill

The fetch from program memory will be disabled, which reduces the power dissipation at the program memory port. Especially during loop execution it can be used to reduce the traffic at the program memory port considerably and therefore to reduce the power dissipation. Details concerning loop handling are discussed in subsection 3.5.

3.3. Align phase In Figure 9 the read process from the instruction buffer is illustrated. As already pointed out, the VLIW does not provide HW support like scoreboards to resolve data dependencies between instructions. In subsection 2.3 it has been mentioned that the execution bundle can be even split over four fetch bundles. Therefore if the execution bundle is spread over several fetch bundles and the instructions are not already stored inside the instruction buffer, stall cycles will be necessary to fetch the instructions needed to build up the execution bundle. The program counter (pointing to the address of the next instruction) and the consecutive addresses will be analyzed, whether instructions of the execution bundle are already inside the buffer. The two lower bits of the program counter are not used for the comparison with the physical address, they are used to address the single instruction words inside the buffer cell. Besides the comparison with the reduced program counter the consecutive addresses will be compared, that means program counter plus 4, 8 and 12 (or reduced program counter +1,+2 and +3). The available buffer cells will be read from the instruction buffer and used for further analysis. Missing addresses will be fetched in the next clock cycle. In Figure 10 the consecutive instruction words have been set together to an

analysis bundle, which will be used to identify the next execution bundle.


019


019


019

program counterprogram counter

Figure 9: Buffer Read

Each of the blocks in Figure 10 corresponds to an instruction word. The p (for parallel) or s (for sequential) inside the blocks indicates whether the instruction word will be executed in parallel with the leading instruction word or separately in the next clock cycle.

pp pp ss pp pp pp pp pp ss ss pp pp ss pp pp ss

Figure 10: Analysis Bundle

Due to HW realization reasons (a serial search of the cells will take too long) the check for the end of the execution bundle is done in parallel for subsections. In Figure 11 each buffer cell will be analyzed separately. Starting from the program counter address (in Figure 11 indicated by the grey arrow) the next valid end (indicated by the arrows below the instruction words) is identified. The analysis procedure of the separate buffer cells takes place in parallel. Due to the maximum size of 10 instruction words building up one execution bundle, the size of 16 entries for the analysis bundle guarantees one valid end inside the analysis bundle.

pp pp ss pp pp pp pp pp ss ss pp pp ss pp pp ss

program counter

Figure 11: Valid End Detection

If a valid end has been detected the next execution bundle is already available and can be forwarded to the decode ports. The proposed instruction set can be split into three

250

operation classes, computation, load/store and program flow operations. The first three bits of each instruction word are used to encode the execution bundle information and the affiliation to the operation class. The operation class of the instruction word will be analyzed at the end of the align phase and the instruction mapped to the corresponding decoder port. Each decoder port is dedicated to a certain operation class. That reduces the overhead of the instruction decoder and allows removing a decoder port, when the related data path is not available any more (considering the aspect that the align unit is part of a configurable DSP concept).

pp pp ss pp pp pp pp pp

program counter

Figure 12: No Valid End

In Figure 12 an analysis bundle is illustrated where stall cycles are getting necessary. The analysis bundle is the same as the example in Figure 11, but not all four consecutive fetch addresses are available. The analysis gives a valid end, but the valid end is located before the actual program counter and therefore is not relevant. The missing valid end in the analysis bundle leads to a stall cycle. Linear program flow code can be scheduled to prevent stall cycles due to missed instruction words. Branches are little bit more critical. The Compiler (or the programmer if coded manually) has to take care that the branch target fits into one fetch bundle to prevent stall cycles. On one side this can be managed by short execution bundles at the branch target or on the other hand, the branch target can be aligned to the start address of a fetch bundle. If the instruction buffer is offering enough entries to keep the loop body, the alignment of the loop is not important. Executing the loop first time will possibly lead to a stall cycle, but further executions will be handled directly from the instruction buffer with no stall cycles necessary.

3.4. Mapping The execution bundle has been identified and has to be mapped to the decoder ports. During the fetch and align phase the operation class and the eventually available parallel words has been detected. With this information, the instructions of the execution bundle are mapped to the related decoder ports. Each of the available decoder ports decodes only instructions of one operation class. Parallel words containing long immediate values and offsets will be directly mapped to the execution units. The order of the instruction words inside the execution bundle is not restricted. In Figure 13 the instruction mapping is illustrated. In dependency of the operation class, the decoder port will be chosen.

load/storeload/store

load/storeload/store

computationcomputation

computationcomputation

program flowprogram flow

operation classparallel word

Figure 13: Instruction Mapping

3.5. Loop handling

The instruction buffer can be used to store the instruction words of the loop body inside the core and no further fetch operations from the program memory during loop execution are necessary. The proposed DSP concept supports the execution of single instruction cycle loops (repeat) and of loops consisting of several execution bundles (block repeat). Executing a repeat loop, the execution bundle that will be executed multiple times is once fetched from the program memory. Further execution cycles take the instruction words from the entries of the instruction buffer. During loop execution, the fetch unit fetches consecutive instruction words into the instruction buffer to prevent stall cycles after the loop execution. To prevent overwriting of loop body entries in the instruction buffer during loop execution, the executed bit of already executed instructions will not be set to one. Executing the loop for the last time (indicated by a hardware loop counter equal to zero), the executed bits of the instruction buffer entries will be set to one (as in linear program flow) and the cell buffers can be filled with new instruction words. Executing a block repeat loop, the relationship between the size of the instruction buffer and the size of the loop body has influence on the efficient use of the buffer. If the loop body fits into the entries of the instruction buffer, the handling is similar to the handling of a repeat loop. The loop body is fetched once from program memory, stored into the instruction buffer and then executed multiple times. During the execution of the loop body the fetch counter is enabled to fill free cell buffer entries of the instruction buffer.

251

loop fits

firstexecution

YES

YES

all executedbits set to 1

buffer fill start

executed bitsnot set to 1

lastexecution

YES

executed bitsset to 1 afterexecuting the

cell buffer

fetch counterfetches

instructionsafter loop body

executed bitsnot set to 1loop bodyexecuted

BKREP Decoded

executed bitsset to 1 after

execution

fetch counterrelated toprogramcounter

lastexecution

YES

Figure 14: Bkrep Program Flow

All executed bits of the instruction buffer entries are set to one before the instruction buffer entries will be filled with instructions of the loop body. That prevents that already fetched entries that are not part of the loop body allocate entries of the instruction buffer. Again the executed bits of the loop body will not be set to one until the loop body will be executed for the last time. If the loop body does not fit into the instruction buffer, the normal loop handling will be deactivated. After executing instructions of the loop body, the executed bits are set to one. The only difference to linear program flow is that the fetch counter is bounded to the program counter that means no fetch of instructions from outside the loop body during loop execution. In Figure 14 the program flow for handling of a block repeat loop (bkrep) is illustrated. A while loop is implemented with a conditional branch instruction. The hardware loop counter cannot be used and the loop body cannot automatically be handled inside the instruction buffer. Therefore, instructions are available to control the behavior of the instruction buffer manually. To prevent unintended malfunction during manual control of the buffer content, automatic control features are included.

3.6. Branches, Interrupt handling

If a branch instruction is executed and the branch is taken, the executed bits of instruction buffer entries will be set to one. The fetch unit then starts filling the buffer cells with instruction words of the branch target. To prevent stall cycles the branch target should fit into the fetch bundle. The handling of interrupts is similar to the behavior of unconditional branches. The branch target address is provided from the ICU (interrupt control unit). The executed bits of the already fetched entries are set to one and the instruction buffer is filled with entries of the ISR (interrupt service routine).

4. RESULTS

The described align unit has been described in VHDL-RTL (including the Fetch Unit). The chosen instruction buffer has 8 buffer cells and allows the storage of 32 instruction words. The simulation results have shown that most of the inner loops typically part of classical DSP functions can be handled with that instruction buffer size. The synthesis results in 0.13 �m technology give an estimated size of 0.15 mm². For dedicated benchmark examples the use of the instruction buffer leads to 60% less switching activity at the program memory port and leads therefore to reduced power dissipation. Anyway, the absolute numbers in this section have to be considered carefully due to the strong dependency on the chosen benchmark examples.

5. CONCLUSION

The align unit enables the usage of multiple data paths in parallel without the drawback of wasting program memory (which is a problem of traditional VLIW architectures). The instruction buffer inside the align unit is used to reduce the width of the program memory port. Loops can be handled inside the instruction buffer that allows to reduce the fetch activity at the program memory port and therefore to reduce the power dissipation. The unaligned program memory increases the system code density of the VLIW architecture. The align unit is part of a development project for a configurable DSP.

REFERENCES

[1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, DSP Processor Fundamentals, Architectures and Features (New York, IEEE Press, 1997).

[2] Dezso Sima, Terence Fountain, Peter Kacsuk, Advanced Computer Architectures: A Design Space Approach (Harlow, Addison Wesley Publishing Company, 1997).

[3] Siemens, OAK DSP Core, Programmers Reference Manual (Siemens AG, 01.98).

[4] Texas Instruments, TMS320C6000 CPU and Instruction Set Reference Guide (Texas Instruments,10.2000).

[5] Infineon Technologies, Carmel DSP Core Architecture Specification (Infineon Technologies, 2001).

[6] Motorola, SC140 DSP Core Reference Manual (Motorola, Rev.0,12/99).

[7] J. L. Hennessy, D. A. Patterson, Computer Architecture. A Quantitative Approach (San Mateo CA, Morgan Kaufmann Publishers, 1996).

252

PUBLICATION 4 C. Panis, M. Bramberger, H. Grünbacher, J. Nurmi, �A Scaleable Instruction Buffer for a Configurable DSP Core�, in Proceedings of 29th European Solid State Conference (ESSCIRC 2003), Estoril, Portugal, September 16-18, 2003, pp. 49-52.

©2003 IEEE. Reprinted, with permission, from proceedings of 29th European Solid State Conference.

A Scaleable Instruction Buffer for a Configurable DSP Core

Christian Panis1, Michael Bramberger2, Herbert Grünbacher1, Jari Nurmi3

1Carinthian Tech Institute 2 Infineon Technologies 3Tampere University of Technology

Europastrasse 4 Siemensstrasse 2 P.O. Box 553 A-9524 Villach A-9500 Villach FIN-33101 Tampere Austria Austria Finland

Abstract:

Increasing system complexity of SOC applications leads to an increasing requirement on powerful embedded DSP processors. To increase the performance of DSP processors the number of parallel-executed instructions has been increased. To program the parallel units VLIW (Very Long Instruction Word) has been introduced. Traditional VLIW architectures feature poor code density and therefore high area consumption caused by the program memory. To overcome this limitation the proposed configurable DSP core supports unaligned program memory, to reduce the size of the program memory port an execution bundle can be mapped onto several fetch bundles. To overcome the memory bandwidth mismatch between fetch and execution bundle an instruction buffer is introduced. Using the instruction buffer during execution of inner loops the power dissipation of the DSP subsystem can be reduced. Cache logic is used to control the entries of the instruction buffer during out-of-order execution. This paper describes the architecture and the implementation of the instruction buffer. The instruction buffer is part of a project for a configurable DSP core.

1. Introduction Increasing system complexity of SOC applications leads to a strong demand on powerful embedded processors. To increase the performance of embedded processors the number of pipeline stages is increased to reach higher clock frequencies and the number of parallel executed instructions is increased to gain higher system performance. To program the available parallel units VLIW has been introduced [1]. The drawback of traditional VLIW architectures is an increase of the program memory and therefore a poor code density [2]. To overcome this problem available DSP architectures decouple fetch and execution bundle. The size of the fetch bundle (and therefore the size of the program memory port) is equal to the size of the maximum possible execution bundle. xLIW [3], a scaleable long instruction word supports a reduced program memory port size. The size of the fetch bundle is constant; the size of the execution bundle can be different each cycle (depending on application specific requirements). One execution bundle can be mapped onto several fetch bundles. To prevent stall cycles due to an incomplete execution bundle, an instruction buffer is introduced. The

entries of the instruction buffer are controlled by cache logic to make use of the advantages of the instruction buffer also during out-of-order execution. A typical feature of DSP algorithms are loop constructs. Therefore the proposed DSP architecture supports zero-overhead loop instructions. The loop handling like decrementing of the loop counter is handled by hardware and does not require further instructions. Using the instruction buffer during loop handling reduces the number of program memory fetch cycles and therefore reduces power dissipation. The loop is fetched once from memory and then executed from the instruction buffer. This paper describes the architecture and the implementation of the instruction buffer. The first section is used to introduce VLIW architectures and the xLIW concept. The second part illustrates specific requirements of the proposed DSP core. The third part illustrates the architecture of the instruction buffer and is followed by a section describing implementation details.

2. Motivation This section is used to briefly introduce the drawback of traditional VLIW architectures concerning code density. Available solutions to overcome this problem are illustrated. At the end of this section xLIW, a scaleable long instruction word is briefly introduced. 2.1. VLIW Traditional VLIW architectures feature poor code density. Instruction scheduling is done in SW and therefore no hardware support for resolving data and instruction dependencies like scoreboards is available (static scheduling). The fetch bundle (instructions which are fetched in parallel) is fetched from program memory and the instructions are decoded. The size of the fetch bundle is equal to the maximum number of parallel executed instructions. An increasing number of parallel data paths leads to a wide program memory port and poor code density due to missing issuing queues. The data dependency inside the application code leads to a poor usage of the available data paths and for traditional VLIW architectures to a poor code density. 2.2. TI C62xx A possibility to overcome this problem is to decouple fetch and execution bundle (instructions which are executed in parallel). The C62xx from Texas Instruments enables to map several execution bundles to one fetch bundle. The size of the execution bundle is scaleable and

is called VLES (Variable-Length Execution Set). The program memory port is 8 instruction words wide, each instruction 32 bits; this leads to a size of the program memory port of 256 bits. As illustrated in Figure 1 the fetch bundle can consist of several execution bundles, which are executed during consecutive clock cycles. The same figure illustrates a problem related to this implementation. The execution bundle has to fit completely into the size of the fetch bundle (the execution bundles in Figure 1 are marked with n, n+1, …).

Figure 1: VLES of TI C'62xx

On one side this leads to a wide program memory port (the size of the fetch bundle is equivalent to the size of the largest possible execution bundle). On the other side still unused program memory addresses are available due to the alignment requirements. Does the execution bundle not completely fit into the remaining space of the actual fetch bundle, it has to be shifted into the next fetch bundle. The static scheduling does not allow to change the order of the execution bundle to optimize the consumed program memory space. 2.3. TI C64xx To overcome the drawback of unused program memory due to alignment restrictions, the C64xx of Texas Instruments allows mapping of parts of the execution bundle into a fetch bundle. The remaining parts of the execution bundle are mapped to the next fetch bundle. For indication of the end of the execution bundle a single bit in the instruction word is used (the same is true for the C62xx architecture). 2.4. SC140 The Starcore SC140 supports a similar concept of decoupling fetch and execution bundle. Instead of using a single bit a prefix is used to indicate the size of the execution bundle. The prefix word also contains information e.g. for predicated execution.

The introduced concepts allow to size the execution bundle to the requirements of the application code and therefore to increase the code density. The concepts still feature a wide program memory port. The size of the port is influenced by the maximum number of instructions, which can be executed in parallel. 2.5. xLIW The proposed DSP core features a similar concept of decoupling of fetch and execution bundle to increase the code density compared with traditional VLIW architectures. The proposed DSP core allows configuring of the main architectural features, which allows to drive the core architecture into an application specific optimum concerning power dissipation and area consumption. To reduce the size of the program memory port, without limitations concerning the calculation bandwidth it is possible to map one execution bundle onto several fetch

bundles. In average the fetch bandwidth and the required execution bandwidth (driven by the requirements of the application code) has to fit. To compensate a bandwidth mismatch, which can take place in small code sections (e.g. inner loops), an instruction buffer (with n-entries) is introduced. The buffer is filled on one side with fetch bundles of constant width, on the other side execution bundles of variable size are set together. Using the introduced instruction buffer for reducing power dissipation, loops are executed from the buffer. The instructions of the loop are fetched only once and are executed several times, without additional fetch cycles; the switching activity at the program memory port can be reduced and therefore the power dissipation of the DSP subsystem.

3. Architectural Requirements This sub-section is used to point out the requirements to the architecture of the instruction buffer. The requirements are scalability, fully deterministic behaviour and power-efficient handling of loop constructs (which includes also e.g. while loops, which will not be handled as hardware loops) • Scalability: The proposed DSP core allows scaling of

most of the architectural features like the size of the register file, the width and number of the data paths, the instruction coding and the memory bandwidth [4]. This is necessary to drive the DSP core architecture to an application specific optimum in power dissipation and area consumption. For the architecture of the instruction buffer this means that the instruction size and the number of possible entries of the instruction buffer have to be scaleable.

• Deterministic behavior: DSP applications require a deterministic time behavior. The execution time of a certain program has to be constant (worst case timing has to be considered during the definition of the system architecture. Therefore a prediction mechanism does not gain any system performance). For the architecture of the instruction buffer this requirement does not allow any influence on the execution time of the program; independently if the instruction words are already in the buffer or have to be fetched from program memory.

• Power efficient loop handling: The instruction buffer should be used during loop handling to reduce the power dissipation at the program memory port. The proposed DSP architecture supports zero-overhead hardware loops. However, the advantage of the instruction buffer should also be used during e.g. while loops (which are implemented by conditional branch instructions) and branch instructions inside loop bodies. To handle while loops manual control of the buffer entries is necessary. During out-of-order execution the advantage of the instruction buffer has to be available.

Considering these requirements the following section will be used to illustrate the chosen instruction buffer architecture.

4. Instruction Buffer This section is used to introduce the chosen architecture of the instruction buffer mainly influenced by the requirements introduced in section 3. 4.1. Program Memory Fetch The instruction buffer is part of the fetch stage of the pipeline. To fulfil the requirement concerning deterministic behaviour the execution of the program has to use the same number of cycles independent whether the instructions are already inside the buffer or not. Therefore at the beginning of the fetch stage, in parallel to the access to program memory, the content of the instruction buffer will be compared with the required instruction words. Assuming the data has been already fetched before and is already available inside the instruction buffer; the fetch cycle to the program memory is suspended. If there is more than one clock cycle reserved for the program memory fetch that is true for all of them. Therefore instructions already available inside the instruction buffer do not reduce the number of pipeline stages, but reduces the power dissipation at the program memory port.

Figure 2: Program Fetch Decision

In Figure 2 the fetch stage of the pipeline is illustrated. In the beginning of the first cycle the decision of suspending the memory access is done. 4.2. Buffer Structure In Figure 3 an example for the instruction buffer is illustrated (the fill operation of the buffer). A fill pointer is filling empty cells. Each cell has additional control bits. The executed bits (E) are used to indicate, whether a fetched instruction word has already been used for execution and therefore can be overwritten with new entries. The valid bits (V) are used to indicate valid entries inside the instruction buffer. Therefore no initialization of the buffer entries is necessary. Assuming linear program flow the buffer can be built up as a circular buffer.

Figure 3: Instruction Buffer Write

If branch instructions interrupt the program flow, already fetched instructions cannot be used any more. That is why cache logic is used to control the entries of the instruction buffer.

4.3. Cache Logic For the cache logic a set-associative approach has been chosen as illustrated in Figure 4. The advantage compared with a fully-associative approach is that the cache directory and the cache data memory are getting the address in parallel [5]. In the fully-associative approach the cache data memory is getting the address sequentially (which increases the critical path in timing). The tag-address is used to differentiate the different pages of the memory addresses. One possible problem can get the cache trashing. Cache trashing takes place when a frequently used location is replaced by another frequently used location. The problem of cache trashing can be reduced by using a n-way set associative cache.

Figure 4: Set Associative Cache

5. Implementation This section is used to illustrate the implementation aspects of the instruction buffer. For regular structures a full-custom implementation has significant advantage in power dissipation, area consumption and reachable frequency (which is quite important due to the critical timing of the program memory access). To obtain the requirement of scalability the DPG (Data Path Generator) of RWTH Aachen was used to develop the full custom parts of the instruction buffer [6]. 5.1. Partitioning In Figure 5 the set associative cache can be split into a CDM (cache data memory) and into a CAD (cache address directory).

Figure 5: CDM and CAD

Both are well suited to be implemented in full-custom due to the regular architecture. The EBM (executed bit management also including the handling of the valid bits) is part of the control logic and therefore implemented in VHDL. The CDM is used to store the instructions inside the instruction buffer, which is organized in cache-lines. The

number of cache lines is scaleable, as also the number of bits of each cache line. This enables the implementation of instruction buffer variants with a different number of entries and also the size of the instruction words can be changed. Each block in Figure 3 (consisting of 4 entries, the first one is marked with grey colour) is equal to one cache line inside the instruction buffer.

0

1

2

3

4

5

6

7

Data in

Data out

Load

MatchX,dbgRead

Figure 6: Cache Line

To prevent an unsymmetrical height to width ratio, two successive bits are placed one upon the other as illustrated in Figure 6. The data are fed vertically and the control (like load, MatchX or debug control signals) of the cells is done horizontally. With the implementation example in Figure 6 the granularity of the size of the instruction words is limited to two bits. If the number of bits per instruction word is odd one cell keeps unused. The CAD is used to store the address of the instructions of the instruction buffer. Each of the address bits is compared with five entries (four data ports and one debug port, illustrated in Figure 7). If the compared entry matches to one of them then the related MatchX signal is set. The MatchX lines are cascaded to reduce the timing critical path. In this implementation example four MatchX lines are grouped.

Figure 7: CAD Architecture

5.2. Results The instruction buffer has been implemented in a 0.13µm CMOS technology with a supply voltage of 1.2V (1.5V is the nominal supply voltage; the lower one has been chosen to reduce power dissipation).

cache lines area MatchX data present time

16 32 64

128

0,069 mm² 0,138 mm² 0,276 mm² 0,552 mm²

920 ps 920 ps 920 ps 920 ps

1.2 ns 1.3 ns 1.4 ns 1.6 ns

Table 1: Implementation Results

In Table 1 implementation results for certain configurations are illustrated. Due to the chosen architecture the timing for the MatchX line is independent of the number of cache lines. The data present time (access time) is increasing with the number of cache lines due to the increasing capacity of the data output lines. In Figure 8 the layout of one cache configuration is displayed (16 cache lines, 80 data bit, 14 address bit).

Figure 8: Layout

The upper part contains the CAD, followed by the driver of the MatchX signals. The lower part in Figure 8 contains the CDM.

6. Conclusion The scaleable instruction buffer as proposed in this paper can be used to reduce the width of the program memory port without limitations to the calculational bandwidth. Inner loops in typical DSP application code can be handled power efficient due to a reduced number of program memory fetch cycles. The scalable implementation of the buffer architecture enables application specific adaptations to minimize the consumed silicon area. The introduced cache logic to control the buffer entries allows making use of the advantages of the instruction buffer during out-of-order execution. The chosen architecture allows minimizing the worst-case execution time as required by real-time constraints of DSP algorithms. The scaleable instruction buffer is part of a project for a configurable DSP core.

7. Acknowledgement The work has been supported by RWTH Aachen and by the EC with the project SOC-Mobinet (IST-2000-30094).

Literature [1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, “DSP

Processor Fundamentals, Architectures and Features”, IEEE Press, New York 1997.

[2] D.Sima, T.Fountain, P.Kacsuk, “Advanced Computer Architectures: A Design Space Approach”, Addison Wesley Publishing Company, Harlow, 1997.

[3] C.Panis, R.Leitner, H.Grünbacher, J.Nurmi "xLIW – a Scaleable Long Instruction Word", ISCAS 2003, Bangkok, Thailand, 2003.

[4] C.Panis, G.Laure, W.Lazian, A.Krall, H.Grünbacher, J.Nurmi "DSPxPlore – Design Space Exploration for a Configurable DSP Core", GSPx 2003, Dallas, Texas, USA, 2003.

[5] J.Handy, “The Cache Memory Book”, Academic Press, 1998

[6] O.Weiss, M.Gansen, T.G. Noll, “A flexible Datapath Generator for Physical Oriented Design”, Proceedings of the ESSCIRC 2001, Villach, 18.-20. September 2001, pp. 408-411

PUBLICATION 5 C. Panis, H. Grünbacher, J. Nurmi, �A Scaleable Instruction Buffer and Align Unit for xDSPcore�, IEEE Journal of Solid-State Circuits, Volume 35, Number 7, July 2004, pp. 1094-1100.

©2004 IEEE. Reprinted, with permission, from IEEE Journal of Solid-State Circuits.

1094 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 7, JULY 2004

A Scalable Instruction Buffer and Align Unitfor xDSPcore

Christian Panis, Member, IEEE, Herbert Grünbacher, Member, IEEE, and Jari Nurmi, Senior Member, IEEE

Abstract—Increasing mask costs and decreasing feature sizestogether with productivity demand have led to the trend of plat-form design. Software programmable embedded cores are usedto provide the necessary flexibility in integrated systems. Facingincreasing system complexity, single-issue digital signal processors(DSPs) have been replaced by cores providing the execution ofseveral instructions in parallel. The most common programmingmodel for multi-issue DSP core architectures is Very Long In-struction Word (VLIW) which is based on static scheduling, andenables minimization of the worst case execution time and reducescore complexity. The drawback of traditional VLIW is poor codedensity, which leads to high program memory requirements and,therefore, requires a large silicon area of the DSP subsystem. Toovercome this problem without limiting the core performance,a scalable long instruction word (xLIW) is introduced. A specialalign unit is used for implementing the xLIW program memoryinterface. In this paper, the align unit and its main architecturalfeature, a scalable instruction buffer, is introduced in detail. xLIWis part of a project for a parameterized DSP core.

Index Terms—Application-specific integrated circuits (ASICs),buffer memories, cache memories, digital signal processors, par-allel architectures, reduced instruction set computing.

I. INTRODUCTION

THE single-issue digital signal processor (DSP) archi-tecture has proven to be inadequate to achieve the

performance required by high-end media processing applica-tions. DSP architects first tried to tackle this problem by addingan extra multiply-accumulate (MAC) unit to the traditionalDSP architecture [1]–[3]. However, this solution is not scalablebeyond two MAC units and even then it is cumbersome toprogram efficiently. Since the late 1990s, the Very LongInstruction Word (VLIW) model has dominated high-endDSP architectures. There are several advantages arising fromthis programming model, but several disadvantages, as well.These disadvantages are especially severe when a low-cost andlow-power implementation is desired. Thus, improvements inthe program memory interface and instruction representationhave emerged to counteract the problems evident in VLIW.

While targeting at high performance of the data processing ar-chitecture, the ease of programming also needs to be addressed.It is important that processors are also efficiently programmablein high-level languages since instruction scheduling in VLIW

Manuscript received October 30, 2003; revised January 20, 2004. This workwas supported by RWTH Aachen, the EC under Project SOC-Mobinet (IST-2000-30094), and Infineon Technologies Austria.

C. Panis and H. Grünbacher are with the Carinthian Technical Institute,A-9524 Villach, Austria (e-mail: [email protected]; [email protected]).

J. Nurmi is with the Tampere University of Technology, FIN-33101 Tampere,Finland (e-mail: [email protected]).

Digital Object Identifier 10.1109/JSSC.2004.829411

relies on compilers that are, at their best, able to reach near-op-timal solutions. It is no longer practical to program a highly par-allel DSP at assembly level, which was and is still extremelycommon with traditional DSP architectures. Therefore, the im-proved architectural features should make the construction ofefficient compilers easier. One of the targets is orthogonality ofthe resources available for the programmer/compiler.

When working with DSP architectures, there is one thing thatcannot be compromised: the predictability of the execution time.Superscalar DSP architectures have been suggested [4], but DSPprogrammers are wary of this approach since the instructionscheduling is performed dynamically. They cannot get a gripon the schedules before the application is running, but need toplan the timing based on the worst case scenario instead. Thereason cache memories in real-time DSP systems are not used isexactly the same. It is due to the uncertainty of meeting the pro-cessing deadlines because of something beyond the control ofthe programmer. The importance of providing program memoryinterfaces that can guarantee real-time operation is, therefore,apparent.

This paper starts with an overview of existing programmemory port architectures. The requirements for improvementsare summarized and the properties of the xLIW processorarchitecture are presented. The related instruction buffer archi-tecture alongside its operation and implementation are covered.The paper concludes with a presentation of results.

II. STATE OF THE ART

This section briefly introduces available implementations ofprogram memory ports for DSPs. The first subsection illustratesVLIW and its related advantages and disadvantages, and thesecond part introduces some commercial implementations ofthe concept having already considered ways of overcoming therelevant drawbacks.

A. VLIW

VLIW is the most common programming model for high-per-formance DSPs [5]. To execute instructions on parallel avail-able units, a very long instruction word is built up of several in-struction words which are used to code the instructions for thedifferent units. The long instruction word is fetched, decoded,and executed. The VLIW model differs to the superscalar pro-gramming model in that there is no issuing queue available [6].Data and control dependencies are resolved in software by staticscheduling [7]. The advantage of static scheduling is determin-istic behavior of the executed algorithms allowing minimization

0018-9200/04$20.00 © 2004 IEEE

PANIS et al.: SCALABLE INSTRUCTION BUFFER AND ALIGN UNIT FOR xDSPcore 1095

Fig. 1. VLIW featuring poor code density.

of the worst case execution time. For real-time critical DSP al-gorithms, it is essential to minimize the worst execution time,hereby ensuring the execution of a section of program to meeta predefined schedule.

The requirement of supporting more computational powerleads to the support of executing several instructions in parallel,and therefore, to an increased size of the VLIW. However, dataand control dependencies in the algorithms result in poor uti-lization of the available parallelism (about two to three instruc-tions per clock cycle [7]). The main disadvantage of VLIW isapparent during static mapping of a fetch bundle to an executionbundle. This is poor code density. Code density indicates howefficiently an algorithm can be mapped to certain core architec-tures and the related instruction set. It is the most significantfactor influencing the consumed silicon area in a highly inte-grated system-on-chip (SoC).

However, an average instruction level parallelism (ILP) of 2 to3 does not necessarily mean that executing several instructionsin parallel is a poor solution for increasing the computationalpower of an embedded DSP core. Inner loops of traditionalDSP algorithms such as filtering and fast Fourier transform(FFT) can make use of the provided parallelism. Code densityof application code is influenced more by control parts of thealgorithms executed.

Fig. 1 illustrates the described problem in a processor archi-tecture providing the execution of up to five instructions in par-allel. Assuming a traditional VLIW programming model andthe execution of control code, the drawback of poor code den-sity becomes obvious.

B. VLES

One way to overcome the problem of poor code density isdecoupling of fetch and execution bundles. Variable LengthExecution Set (VLES) allows resizing of the execution bundledepending on the algorithm executed. It is possible to mapseveral execution bundles to one fetch bundle, which intensi-fies the usage of program memory. However, the size of theprogram memory port is equal to the maximum possible sizeof the execution bundle which can lead to significant routingbetween DSP core and memory ports, see, e.g., [9] and [10].

C. CLIW

Configurable Long Instruction Word (CLIW) allows an re-duction of the regular program memory port size. The disad-vantage of reducing the size of the program memory port is thedecreased peak performance of the core architecture. Availablecommercial architectures are overcoming this drawback by sup-porting an extended program memory port [11] used during ex-ecution of inner loops. Even if not frequently used the size of the

Fig. 2. Constant fetch bundle; scalable execution bundle.

program memory port is still equal to the supported peak perfor-mance. The extended program memory port is a nonorthogonalfeature and therefore efficient C-compilation is not feasible.

D. xLIW

xLIW is the scaleable long instruction word of xDSPcore,an application-specific configurable DSP core architectureefficiently programmable in C [12]. Efficiently in this sensemeans featuring less than 10% overhead compared withmanual coding which allows omitting the manual assemblyoptimization, which is very slow and error-prone. This allowsarchitecture-independent description of the algorithms.

xLIW is based on the traditional VLIW programming model(static scheduling) and as with VLES the fetch bundle and theexecution bundle are decoupled. Additionally, the size of theprogram memory port is reduced: for the example in Fig. 2,the size of the worst case execution bundle consists of up toten instruction words, while the associated fetch bundles con-sists of only four instruction words. Considering average ILP oftwo to three operations [7], the reduced size of the fetch bun-dles provides enough instructions for efficient use of the coreresources. The memory bandwidth problem of the reduced fetchbundle compared with the worst case execution bundle duringsupporting peak performance is solved by introducing the alignunit which contains an internal instruction buffer. The align unittakes care of loading the fetch bundle from program memoryand setting up the execution bundle whose execution order isnot allowed to be changed due to static scheduling.

In Fig. 3, the main functionality of the align unit is illus-trated. The unit connects to the program memory to fetch thefetch bundle. It stores the fetched instructions to compensate thememory bandwidth problem during execution of inner loops andsetting up the next execution bundle. The align unit is respon-sible for insertion of stall cycles if the consecutive executionbundle is not available. This can take place at nonaligned branchtargets or if the execution set at the branch target exceeds the sizeof the fetch bundle. The central part of the align unit is an in-struction buffer which is discussed in detail in the next section.

III. INSTRUCTION BUFFER

The instruction buffer compensates for the memory band-width mismatch between the fetch bundle and the worst caseexecution bundle. It enables a reduced program memory portwithout limiting the use of available core resources and reducesswitching activity at the program memory port during loop ex-ecution. The first subsection describes the chosen architecture


Fig. 3. Align unit overview.

Fig. 4. Storage cell.

for the instruction buffer by considering the requirements dis-cussed above. The second part describes some implementationdetails.

A. Architecture

The instruction buffer is built up of instruction storage cells.As illustrated in Fig. 4, one storage cell consists of instructionwords. The number of entries ( ) is equal to the number of in-struction words building up one fetch bundle. For the examplein Fig. 4, each storage cell contains four entries. As well as theinstruction words, the physical address is stored and later usedfor comparing fetch pointer content with addresses stored in theinstruction buffer. Each storage cell includes two control bits,namely the valid ( ) and the executed ( ) bits. The valid bit isused for indicating valid entries in the storage cell. After reset,all bits are flushed. The same feature can be used during runtime, when the buffer entries are flushed without rewriting therelated cells, thus, saving power. The executed bit indicates thestatus of the entries in the storage cell. If all entries of a storagecell have been already used building up an execution bundle therelated bit is set. Storage cells with a set bit can be over-written with new fetch bundles fetched from program memory.

The main function of the instruction buffer is to compensatethe possible memory bandwidth problem between the reducedfetch bundle and the worst case execution bundle. Furthermore,the instruction buffer can also be used during the execution ofloop bodies. The loop is fetched once from program memoryand then executed from the buffer. Therefore, the fetching can

Fig. 5. Hardware loop.

be suspended during loop execution, which reduces switchingactivity at the program memory port. During execution of aloop body, the bits of already-used storage cells are not setto prevent overwriting of the storage cells containing the loopbody. This feature prevents cache trashing [13]. When executingthe last loop iteration, the bits are handled as during han-dling of linear code, allowing the pre-fetching of instructionslocated after the loop body. During normal program flow, thefetch counter is slightly decoupled from the program counter.Therefore, it is possible to fetch data into the instruction bufferseveral cycles before the entries will be used to build up the nextexecution bundle, thus, preventing stall cycles due to missing in-structions. During loop execution, this feature is disabled to pre-vent fetching instructions that cannot be stored in the instructionbuffer.

Many DSP architectures feature hardware support forloops. The control of the loop, such as decrementing of theloop counter, is implicitly done in hardware. Traditional DSPalgorithms like filtering frequently contain loop instructions.Therefore, the hardware loop can be used to reduce executioncycles and, therefore, power dissipation and the number ofinstructions and subsequently increasing code density. An ex-ample is illustrated in Fig. 5 [5]. The hardware loop instructionscan be identified during run time and the loop body is executedfrom the entries of the instruction buffer.

In the case of loop constructs where the loop execution countdepends on a condition (e.g., while loop) have to be imple-mented as conditional branch instructions. The core hardwarecannot identify these software loop constructs during run time.To make use of the features of the instruction buffer during exe-cution of while loops, the handling of the buffer content can bemanually controlled by locking and unlocking of storage cells.

The number of storage cells building up the instruction bufferis scaleable according to the application requirements which al-lows for making tradeoffs between area consumption caused bythe storage cells and the advantage of executing loop constructs.If the number of buffer entries is too low, the loop body cannotbe fully stored inside the instruction buffer (as application ex-amples in Fig. 11 illustrate).

B. Dynamic Power Dissipation

To determine and reduce power dissipation at the programmemory port, the switching power dissipation model

is used to identify the advantages of the reducedswitching activity. Where is the supply voltage, the outputcapacitance and the transition density, all correspond toone gate [14]. Reducing power dissipation by voltage scalingis efficient due to the quadratic dependency [15], but cannotbe considered within an embedded DSP core separate from thecomplete system. Reducing output capacitance by consideringand preserving locality between cells is an implementation issueand is discussed in a later section. Therefore, minimizing the


transition density factor can be used to reduce dynamicpower dissipation on core level.

xDSPcore development considers this aspect on all layers:high code density and, therefore, less program memory ac-cesses on architectural level, and reduced switching activityby instruction reordering during compile time [16] and duringexecution time by making use of the instruction buffer duringloop handling. Benchmark results illustrate the efficient useof the instruction buffer for reducing transition activity duringloop execution (illustrated in Section IV).

C. Buffer Control

Control of the entries of the instruction buffer is done with-way set-associative cache logic [13]. The advantage com-

pared with fully associative cache logic is the possibility of aparallel search in the cache tag directory and the cache memory.The reason to implement a more sophisticated cache logic hasbeen the ability to make use of the instruction buffer duringbreaks in program flow, or even if branches in a loop constructoccur. Control by this cache logic does not implicitly mean pro-viding a conventional cache: the behavior is still fully deter-ministic. If a requested instruction word is already stored inthe instruction buffer, it is possible to suspend the memory ac-cess and, therefore, to reduce power dissipation. The executiontime remains unchanged. The same is true if the entry has to befetched from program memory. There is no cache miss penalty.

D. Implementation

The instruction buffer introduced in this paper is part of a pa-rameterized embedded DSP core architecture named xDSPcore.xDSPcore enables to parameterize architectural key features,like, e.g., bit width and register file size, to meet application spe-cific requirements. For this purpose, a flexible implementationwhich is easily portable to different product variants and dif-ferent silicon technologies is required. However, performancerequirements of different applications do not allow to apply asynthesis-based semi-custom design flow.

The alternative, a manual full-custom design, would meetthe performance requirements, but the effort for implementationand porting to different technologies is not tolerable. This be-comes even more evident due to the fact that scalability, a mainfeature of xDSPcore, is not supported by traditional manualfull-custom design flow.

To satisfy the requirements of scalability and portability of theembedded DSP core architecture on one hand and meeting theperformance requirements on the other hand, critical structuresof the align unit are implemented in a physical-oriented way byusing a dedicated data path generator (DPG) [17].

Starting from a high level description of the signal flow graph(SFG), the DPG assembles highly optimized macro layoutsfrom abutment cells. These abutment cells are automaticallyderived from a small library of optimized leaf cells. Thisexploits the inherent regularity and locality typical for SFGs indigital signal processing for the optimization of silicon area,throughput rate, and especially power dissipation, and offersthe possibility for iteratively optimizing the SFG by simplymodifying the SFG description. Porting the align unit to a new

Fig. 6. CDM and CAD partitioning.

silicon technology only requires porting of the small leaf celllibrary, while the architectural description remains unchanged.

The above-described methodology is especially well suitedfor regular structures and makes use of locality, allowing shortrouting distance and low driving capacity, and therefore, leadingto lower power dissipation [18]. VHDL-RTL is used for im-plementation of the remaining parts of the align unit which aremainly control logic taking care of storage cell management.

As illustrated in Fig. 6, the -way associative buffer can besplit into a cache data memory (CDM) and a cache address di-rectory (CAD). Both are well suited to be implemented usingDPG; the executed bit management (EBM) including handlingof the valid bit is part of the control logic and, therefore, im-plemented in VHDL-RTL. The same is true for the logic thatis responsible for building up the execution bundle. The CDMis used to store the instructions inside the instruction bufferwhich is organized in cache lines. The number of cache lines(equal to the number of fetch bundles, which can be stored in-side the instruction buffer) and the number of bits of each cacheline is scaleable (which allows changing the number of bitsused for each instruction word). This enables the implementa-tion of instruction buffer variants with different numbers of en-tries and with instruction words with different numbers of bits.Each storage cell as in Fig. 4 (consisting of four entries) is equalto one cache line inside the instruction buffer. To prevent an un-symmetrical height to width ratio (especially for an instructionbuffer with a small number of entries), two successive bits areplaced one upon the other, as illustrated in Fig. 7. The data isfed vertically and the control (like load, MatchX, or debug con-trol signals) of the cells is done horizontally. With the imple-mentation example in Fig. 7, the granularity of the size of theinstruction words is limited to two bits. If the number of bits perinstruction word is odd, one cell remains unused. The debug portallows reading the content of storage cells without influencingthe content of the instruction buffer or the buffer behavior.

The CAD is used to store the address of the instructions ofthe instruction buffer. Each of the address bits is compared withfive entries (four data ports and one debug port, illustrated inFig. 8). If the compared entry matches to one of them, then the


Fig. 7. Bit alignment.

Fig. 8. Read ports/MatchX lines.

related MatchX signal is set. The MatchX lines are cascaded toreduce the timing critical path. In this implementation examplefour MatchX lines are grouped.

IV. RESULTS

This section is in two parts. The first part considers the ar-chitectural advantages of using the above-described instructionbuffer. Implementation examples of algorithms are used toillustrate reduced switching activity at the program memoryport. The second part illustrates the results of the chosenimplementation.

A. Application Examples

The instruction buffer is used for reducing switching den-sity and, therefore, dynamic power dissipation. For the resultsin Fig. 9, different finite-impulse response (FIR) filter imple-mentations have been used. The cycle count is normalized to100% while the x axis shows different filter configurations. Thegrey-shaded part indicates the number of cycles that make useof the instruction buffer. For these filter kernels, no access toprogram memory is required for most of the execution time.

TABLE IRESULTS OF BUFFER VARIANTS

Vector operations are also based on inner loop constructs, asillustrated in Fig. 10. Increasing the number of operated ele-ments leads to an increasing number of instruction cycles, whichmake use of the instruction buffer.

Fig. 11 contains some application examples, for example,the cryptography algorithm Blowfish, some examples of voicecoders (adpcm, G.711, G723), and an implementation ofHuffman coding. For adpcm, Blowfish, and G.723, most of thealgorithm can make use of the instruction buffer. The Huffmancoding example illustrates the limitation of the instructionbuffer implementation. The loop body produces so many exe-cution cycles that it exceeds the size of the instruction buffer.

Summarizing the results, for loop-centric algorithms whichare typical for algorithms implemented on DSP cores, the in-struction buffer significantly reduces switching activity at theprogram memory port.

B. Buffer Implementation

The instruction buffer has been implemented in a 0.13- mCMOS technology with a supply voltage of 1.2 V. In Table I,implementation results for certain configurations are illustrated.Due to the chosen architecture the timing for the MatchX lineis independent of the number of cache lines. The data presenttime (access time) increases with the number of cache lines dueto the increasing capacitance of the data output lines. For smallfilter kernels or vector operations (as illustrated in the previoussubsection), a cache configuration supporting 16 cache lines iswell suited to compensate the memory bandwidth problem andto store loop bodies with reasonable area overhead.

In Fig. 12, the layout of the full-custom parts of a cache con-figuration is displayed (16 cache lines, 80 data bits, and 14 ad-dress bits). The upper part contains the CAD followed by thedriver of the MatchX signals. The lower part of Fig. 12 containsthe CDM.

V. SUMMARY

This paper introduces a scaleable long instruction word(xLIW), which is part of a DSP core architecture that enablesefficient programming in a high-level language like C. xLIWis based on traditional VLIW with static scheduling, whichenables minimizing of the worst case execution time. xLIWenables efficient use of the program memory and, therefore,features high code density without limiting the usage of theavailable core resources. The key feature of xLIW is the alignunit whose main architectural unit is an instruction bufferwhich is introduced in this paper in detail. Making use of theinstruction buffer for loop handling reduces power dissipation


Fig. 9. FIR filter operations.

Fig. 10. Vector operations.

Fig. 11. Application examples.

both in hardware and software loops. The chosen architec-ture enables scaling of the main architectural features of theinstruction buffer to satisfy application-specific requirementsand to prevent silicon area overhead. Within practical buffer

sizes, the total access time in contemporary technologies iswithin 1–2 ns, which does not form a bottleneck for the DSPsubsystem. xLIW is part of a project for a parameterized DSPcore architecture.


Fig. 12. Layout (full-custom part).

REFERENCES

[1] “DSP 16000, Digital Signal Processor Core,” Agere Systems, ReferenceManual 06.2002.

[2] “Teak DSP Core,” ParthusCeva, Inc., Data Sheet, 2002.[3] “ADSP 21535, Blackfin DSP Hardware Reference,” Analog Devices,

Inc., Preliminary Edition, 11.2001.[4] “ZSP 400, Digital Signal Processor Architecture,” LSILogic Corp.,

DB14-00012103, 4th ed., Dec. 2001.[5] P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor Funda-

mentals, Architectures and Features. New York: IEEE Press, 1997.[6] J. L. Hennessy and D. A. Patterson, Computer Architecture. A Quanti-

tative Approach. San Mateo, CA: Morgan Kaufmann, 1996.[7] D. Sima, T. Fountain, and P. Kacsuk, Advanced Computer Architectures:

A Design Space Approach. Harlow, MA: Addison Wesley, 1997.[8] “TMS320C6000 CPU and Instruction Set Reference Guide,” Texas In-

struments, Inc., 10.2002.[9] “TMS320C6201 Technical Overview,” Texas Instruments, Inc.,

SPRS051G, 01.1997 (revised 11.2000).[10] Texas Instruments, “TMS320C64x Technical Overview,” Texas Instru-

ments, Inc., SPRU395, 02.2000.[11] “Carmel DSP Core Architecture Specification,” Infineon Technologies,

2001.[12] C. Panis, R. Leitner, H. Grünbacher, and J. Nurmi, “xLIW—a scaleable

long instruction word,” in Proc. IEEE Int. Symp. Circuits and Systems(ISCAS), Bangkok, Thailand, 2003, pp. V-69–V-72.

[13] J. Handy, The Cache Memory Book. New York: Academic, 1998.[14] M. Ketkar, S. S. Sapatnekar, and P. Patra, “Convexity-based optimization

for power-delay tradeoff using transistor sizing,” in Proc. ACM/IEEEInt. Workshop on Timing Issues in the Specification and Synthesis ofDigital Systems (TAU’00), Austin, TX, 2000, pp. 52–57.

[15] T. Pering, T. Burd, and R. Brodersen, “The simulation and evaluationof dynamic voltage scaling algorithms,” in Proc. Int. Symp. Low PowerElectronics and Design (ISLPED), Monterey, CA, 1998, pp. 76–81.

[16] U. Hirnschrott and A. Krall, “VLIW operation refinement for reducingenergy consumption,” in Proc. IEEE Int. Symp. System-On-Chip (SOC),Tampere, Finland, 2003, pp. 131–134.

[17] O. Weiss, M. Gansen, and T. G. Noll, “A flexible datapath generatorfor physical oriented design,” in Proc. ESSCIRC, Villach, Austria, Sept.2001, pp. 408–411.

[18] V. S. Gierenz, R. Schwann, and T. G. Noll, “A low power digital beam-former for handheld ultrasound systems,” in Proc. ESSCIRC, Villach,Austria, 2001, pp. 276–279.

Christian Panis (M’98) received the M.Sc. degreeform Vienna University of Technology, Vienna, Aus-tria, in 1995. Since 2002, he has been working to-ward the Ph.D. degree at Tampere University of Tech-nology, Tampere, Finland. His research topic is in thearea of digital signal processor architectures.

From 1996 to 2002, he was with InfineonTechnologies Austria as Development Engineerand Project Manager for wireline communicationproducts.

Herbert Grünbacher (M’73) was born in Kitzbühel,Austria, in 1945. He received the Dipl.-Ing. degree(cum laude) from Graz University of Technology,Austria, in 1971. In 1979, he received the Ph.D.degree (cum laude) from the same university on thesubject of computer-aided circuit analysis.

In 1982, he joined Austria Microsystems Inter-national, Graz, where he worked on analog anddigital full-custom circuits and was responsible forthe design groups in Unterpremstätten, Austria, andSwindon, U.K. From 1986 to 1987, he was with

Siemens Components, Munich, Germany, as Deputy Director for MOS CAD.In 1987, he became a Full Professor of computer engineering at the ViennaUniversity of Technology, Vienna, Austria, heading the VLSI Design Groupof the Institute for Computer Engineering. From 1998 to 2003, he was onleave from Vienna University of Technology to head Carinthia Tech Institute,Villach, Austria. Since 2003, he has been with back with Vienna Universityof Technology. His current research interest is system on chips with focus onautomotive applications.

Dr. Grünbacher organized the European Solid-State Circuits Conference (ES-SCIRC) 1989 in Vienna, Austria, and in 2001 in Villach, Austria. He is memberof the ESSCIRC/ESSDERC steering committee and was Executive Secretaryfrom 1992 to 2002. He is also a member of the steering committee of the FPLconference series since 1992 and evaluator and reviewer in several Europeanresearch programs.

Jari Nurmi (S’88–M’91–SM’01) received theM.Sc., Licentiate of Technology, and Doctor ofTechnology degrees from Tampere University ofTechnology (TUT), Finland, in 1988, 1990, and1994, respectively.

From 1987 to 1994, he was with TUT as a Re-search Assistant, Teaching Assistant, Research Sci-entist, Project Manager, Senior Research Scientist,and Acting Associate Professor (1991–1993). From1995 to 1998, he worked for VLSI Solution Oy, Tam-pere, as the company Vice President responsible for

DSP processor development activities. Since January 1999, he has been a Pro-fessor at the Institute of Digital and Computer Systems of TUT. He is the headof the national TELESOC graduate school. He is the author or coauthor of over80 international papers, coeditor of Interconnect-centric Design for AdvancedSoC and NoC (Kluwer), and has supervised more than 60 M.Sc., Licentiate,and Doctoral theses. His current research interests include system-on-chip inte-gration, on-chip communication, embedded and application-specific processorarchitectures, and circuit implementations of digital communication and DSPsystems.

Dr. Nurmi is the Chairman of the annual International Symposium onSystem-on-Chip and its predecessor SoC Seminar in Tampere since 1999 anda board member of the NORCHIP conference series. He is a senior member ofthe IEEE Signal Processing Society, Circuits and Systems Society, ComputerSociety, Solid-State Circuits Society, and Communications Society, and amember of EIS (the Finnish Society of Electronics Engineers).

PUBLICATION 6 C. Panis, U. Hirnschrott, A. Krall, G. Laure, W. Lazian, J. Nurmi, �FSEL � Selective Predicated Execution for a Configurable DSP Core�, in Proceedings of IEEE Annual Symposium on VLSI (ISVLSI-04), Lafayette, Louisiana, USA, February 19-20, 2004, pp. 317-320.

©2004 IEEE. Reprinted, with permission, from proceedings of IEEE Annual Symposium on VLSI.

FSEL - Selective Predicated Execution for a Configurable DSP Core

C. Panis Carinthian Tech

Institute [email protected]

U.Hirnschrott, A.Krall Vienna University of

Technology [email protected]

[email protected]

G.Laure, W.Lazian Infineon Technologies

Austria [email protected] [email protected]

J. Nurmi Tampere

University of Technology

[email protected]

Abstract

Increasing system complexity of SOC applications

leads to an increased need of powerful embedded DSP processors. To fulfill the required computational bandwidth, state-of-the-art DSP processors allow executing several instructions in parallel and for reaching higher clock frequencies they increase the number of pipeline stages. However, deeply pipelined processors have drawbacks in the execution of branch instructions: branch delays. In average not more than two branch delay slots can be used, additional ones keep unused and decrease the overall system performance. Instead of compensating the drawback of branch delays (e.g. branch prediction circuits) it is possible to reduce the number of branch delays by reducing the number of branch instructions. Predicated execution (also guarded execution or conditional execution) can be used for implementing if-then-else constructs without using branch instructions. The drawback of traditional predicated execution is decreased code density. This paper introduces selective predicated execution based on FSEL which allows reducing the number of branch instructions without decreasing code density. Selective predicated execution based on FSEL is part of a project for a configurable DSP core.

1. Introduction Increasing system complexity of SOC applications

leads to an increasing demand on computational power of embedded processors. Deep pipelined processors are used for reaching higher clock frequencies. But deep pipelined processors have obstacles when executing branch instructions: branch delays [1].

Branch delays are caused by taken branch instructions which cause a break in the linear program flow. Branch delays can lead to significant performance lack of the processor sub-system. To overcome this drawback branch prediction circuits

have been introduced [2][3][4][5]. Especially in the area of DSP algorithms deterministic behavior is required, which contradicts prediction approaches. During system definition the worst case execution time has to be considered and the prediction assumed as not to be taken. Therefore prediction has no added value for system performance. Another approach for reducing the number of branch delays is reducing the number of branch instructions. Predicated execution can be efficiently used to remove conditional branch instructions caused by if-then-else constructs. Predicated or conditional execution has already been introduced in the 80�s. The main drawback of predicated execution is an increased program code space. This paper introduces selective predicated execution based on FSEL enabling a reduced number of branch instructions without the drawback of increased code space. Only code sections which can make use of the advantage of selective predicated execution need additional instruction space. The chosen orthogonal implementation of FSEL can be efficiently used by a C-Compiler.

The first part of the paper is used for illustrating the motivation of using predicated execution. The second part introduces two implementation examples of predicated execution (Texas Instruments C�62xx, Starcore SC140). The third section is used for introducing selective predicated execution based on FSEL. The result section contains some benchmark results comparing algorithm implementations using selective predicated execution.

2. Motivation This section describes the branch delay problem

caused by branch instructions in deeply pipelined processors. The number of branch delays depends on the number of pipeline stages located between the instruction fetch stage and the branch condition evaluation. Two possible solution approaches are

0-7695-2097-9/04 $20.00 (c) 2004 IEEE

317

shortly explained in this section: Branch prediction and predicated execution.

Today�s VLIW (Very Long Instruction Word) DSP processors provide additional computational bandwidth by supporting the execution of several instructions in parallel and by increasing the possible clock frequency due to deep pipeline structures. Whether an application can make use of the provided parallelism is mainly influenced by data dependence between instructions and by the branch instruction frequency. Less branch instructions lead to longer basic code blocks and therefore to a higher possibility to schedule instructions in parallel (increased instruction level parallelism). In [6][7][8] the branch frequency of benchmark examples is analyzed. The ratio is different for scientific code and general-purpose programs (GP). In average general purpose programs have a branch ratio between 20-30%, for scientific code it is still 5-10%. Even for scientific programs (which will be more significant for programs running on an embedded DSP core) each 10th to 20th instruction is a branch instruction. The ratio between conditional and unconditional branches is about 75% conditional branch instructions. Assuming a processor providing several parallel units, the distance between branch instructions is getting quite low. Therefore branch delay slots will consume a significant number of cycles.

One way to reduce the penalty of branch delays is the usage of branch prediction. Grohoski [9] divides conditional branch instructions into loop closing branch instructions (e.g. caused by while loops) and other conditional branches. Loop closing conditional branches will be taken for n-1 times. Assuming that the remaining conditional branches will be taken with 50% probability, this leads to a ratio of 5/6 to 1/6 between taken and not taken branch instructions. Other literature sources [6][10] estimate a ratio of ¾ to ¼, which still justifies the emphasis on the effective implementation of the taken branch instructions.

There are several branch prediction implementations available, getting trickier as the number of pipeline stages is increasing [2][3][4][5]. However, this is not in the focus of this paper.

Another possibility to overcome the problem of unusable branch delays is predicated or guarded execution. It can be used to eliminate conditional branch instructions e.g. generated by if-then-else constructs, which are common in control code. It consists of a condition part and an operation part:

(condition) operation

Already in the HP Precision Architecture (1985) conditional execution has been introduced. To quantify

the advantage of this architectural feature Pnevmatikatos and Sohi [11] have analyzed benchmark programs (including Espresso, Gcc and Yacc). About 20% of the instructions have been conditional and 5% unconditional branch instructions.

In their study they distinguish between fully guarding which assumes that all instructions can be executed conditionally, and restricted guarding which enables only to execute arithmetic instructions under certain conditions. Detailed results can be found in [11]. For these benchmark examples about one-third of the conditional and unconditional branches can be replaced using full guarding. For restricted guarding the numbers are lower: about 15% of the conditional and 2% of the unconditional branch instructions can be replaced.

The drawback of guarded execution is the growth of the basic block size. In the above discussed benchmark examples the size of the basic blocks increases from 4.8 to 7.3 instructions for full guarding. Using restricted guarding the enlargement is quite less.

Today most of the VLIW architectures support a mechanism of guarded execution. This is mainly influenced due to the aspect that VLIW architectures support a high ILP (instruction-level parallelism), which requires effective branch handling to prevent severe performance limitation.

3. Predicated Execution This section is used to illustrate implementation examples of predicated execution. The Texas Instruments C62x family and the Starcore SC140 have been chosen. Both VLIW DSP architectures provide the possibility to execute several instructions in parallel, and therefore predicated execution is mandatory to prevent a performance lack in code sequences with high branch frequency.

3.1. TI C62x The C62x architecture of Texas Instruments supports each instruction to be executed conditionally (full guarding) [12]. To obtain full guarding, 3 bits of each instruction word are used to decode the register whose status is needed to generate the condition. The possible registers are B0, B1, B2, A1 and A2. Under certain conditions A0 can also be used. The remaining coding space (with 3 bits it is possible to encode 8 states) is used to encode unconditional execution and one code combination keeps reserved.

Figure 1: TI C62x instruction example (addk)

318

The instruction example in Figure 1 shows the leading three bits labeled creg used to code one of the registers. The z bit following the creg is used to decode, whether the test takes place for equal to zero (z=1) or not equal to zero (z=0). Each instruction consumes 4 bits to code the condition for the predicated execution, which has influence on the code density. The implementation to encode static registers is useful for the scheduler of the C-Compiler which has a certain freedom of reordering instructions, which is especially necessary for a VLIW architecture supporting the execution of several instructions in parallel. The limitation on a few registers of the register-file supporting predicated execution leads to a restricted use of these registers.

3.2. Starcore SC140 The architecture of the SC140 supports full guarding [13]. Instead of spending the code to each instruction the prefix (already used to build up the execution bundle) is used to code the condition. Execution bundle are those instructions executed during the same clock cycle. There are two subsets per execution bundle possible (even and odd). In the assembly syntax, three instructions are available. IFF is used for instructions of the current set which will be executed, if the flag T is equal to zero. If T is one, the instructions are handled as NOP. The IFT instruction is used for the inverse function. If T is equal to one, the instructions will be executed, if T is equal to zero the instructions will be treated as NOP instructions. The IFA is used for instructions of the same execution bundle, which are executed unconditionally.

The predicated execution implementation of the Starcore SC140 consumes less code space for implementing predicated execution, but the limitation on the status of T leads to a significant limitation for efficient instruction scheduling.

4. Selective Predicated Execution Selective predicated execution based on FSEL is implemented as separate instruction, which enables to execute the instructions of the same execution bundle conditionally (as illustrated in Figure 2). Therefore the disadvantage of additional coding space (as pointed out in section 2) is restricted to sections, where predicated execution provides added value.

4.1. Architecture Referring to section 2, the proposed concept is supporting partial guarding, which means that not all of the instructions can be executed conditionally.

Different to the definition of partial guarding by Pnevmatikatos and Sohi [11], all instructions with exception of the program flow instructions can be conditionally executed. The FSEL instruction is part of the program flow execution slot and therefore no program flow instruction can be part of the execution bundle. To enable conditional branch instructions the condition is coded in the instruction word itself.

Figure 2: Influence of FSEL on instruction slots

In Figure 2 the influence of the FSEL instruction is illustrated. The FSEL instruction contains the execution condition for the instructions of the same execution bundle. However, not all of the instructions of the execution bundle have to be executed conditionally. Therefore the FSEL instruction supports coding space to enable unconditional execution of instructions in parallel (don�t care section).

4.2. Code example The code example in Figure 3 illustrates the feature of the FSEL instruction. An if-then-else construct is well suited for this purpose. If the condition is true, the first instruction shall be executed, if not then the second one. On the right side in Figure 3 the related assembly code for the chosen DSP concept can be seen. Assuming a five stage pipeline, two branch delays will decrease the system performance. In the example of Figure 3 the worst case scenario has been pointed out: none of the available branch delays caused by branch instructions can be filled with useful instructions (NOP instructions are inserted). Assuming the if-then-else construct in Figure 3 as part of a longer code section, some of the branch delays get filled with instructions executed anyway. In this example the available resources of the DSP core cannot be used. Therefore the short program sequence has to be executed sequentially.

Figure 3: Code example

Using FSEL the if-then-else construct can be coded within one assembly line and executed within one clock

319

cycle, as illustrated in Figure 4. The dc (don�t care) section is not used for this example but can be used for instructions executed unconditionally.

Figure 4: Code example using FSEL

Besides increasing code density (no NOP instructions are inserted), the number of execution cycles can be reduced. Both aspects have influence on the power dissipation of the DSP subsystem. Fetching fewer instructions reduces the switching activity at the program memory port. Less cycles for executing a program reduces the required clock frequency.

5. Results In Table 1 some benchmark examples are illustrated. The first column contains the name of the chosen algorithm. The remaining columns contain relative numbers in %. The benchmark results are generated once without using predicated execution and once making use of selective predicted execution.

algorithm Nr.of bundle (%)

Nr. of branch delay NOPS (%)

Code size (%)

Blowfish 98,4 80,8 99,2 Dspstone 98,8 94,6 100,1 Efr 91,0 76,6 98,3 Huffmann 88,6 79,9 99,3 Serpent 95,6 88,9 100,9

Table 1: Benchmark results

The results in the table indicate a reduction of execution bundles and a reduction of necessary branch delay NOPs by using selective predicated execution. The influence on code density is neglectable. Thus, the use of selective predicated execution based on FSEL allows increasing system performance by reducing the number of branch delays without decreasing code density.

6. Acknowledgement The work has been supported by European Commission with the project SOC-Mobinet (IST-2000-30094) and the CDG Gesellschaft.

7. Conclusion Predicated execution can be used to reduce the number of branch delays by reducing the number of branch

instructions. The number of execution cycles can be decreased (less branch delays) which reduces the required clock frequency for executing an algorithm. A reduced number of branch delay NOPs leads to reduced switching activity at the program memory port. Lower clock frequency and less switching activity at the program memory port decrease the power dissipation of the DSP subsystem.

Traditional implementations of predicated execution feature poor code density. Selective predicated execution as introduced in this paper provides the advantages of predicated execution by reducing the number of unused branch delays, without decreasing code density. Selective predicated execution based on FSEL is part of a project for a configurable DSP core.

8. References [1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, DSP

Processor Fundamentals, Architectures and Features, IEEE Press, New York, 1997.

[2] Smith J.E., A study of branch prediction strategies, in Proc. 8th ISCA, pp.135-48, 1981.

[3] Albert D. and Avnon D., �Architecture of the Pentium Microprocessors�, IEEE Micro, June 1993.

[4] Heinrich J., MIPS1000 Microprocessor Users Manual Alpha Draft 11.Oct, MIPS Technologies Inc., Mountain View. Ca, 1994

[5] Motorola Inc., Power PC620 RISC Microprocessor Technical Summary, MPC 620/D, Motorola Inc., 1994

[6] Lee J.K.F. and Smith A.J., �Branch prediction strategies and branch target buffer design�, Computer 17(1), pp.6-22, 1984.

[7] Stephens C.,Cogswell B. Heinlein J., Palmer G. and Shen J.P., Instruction level profiling and evaluation of the IBM RS/6000. In Proc. 18th ISCA, pp.137-46, 1991.

[8] Yeh T.-Y. and Patt Y.N., Alternative implementations of two-level adaptive branch predictions. In Proc. 19th ISCA, pp.124-34, 1992.

[9] Grohoski G.F., �Machine organization of the IBM RISC System/6000 processor�, IBM J.Res. Develop., 34(1), Jan., 37-58, 1990.

[10] Edenfield R.W., Gallup M.G., Ledbetter Jr., W.B., Mc.Garity R.C., Quintana E.E. and Reininger R.A., �The 68040 processor�, IEEE Micro, pp. 66-78, Feb. 1990.

[11] Pnevmatikos D.N. and Soshi G.S., �Guarded Execution and branch prediction in dynamic ILP processors�, In Proc. 21st ISCA, pp. 120-9, 1994.

[12] Texas Instruments, CPU and Instruction Set Reference Guide, SPRU189B, Texas Instruments, July 1997.

[13] Motorola Inc. and Lucent Technologies Inc. SC140 DSP Core Reference Manual, MNSC140CORE/D, Rev.0, 12.1999.

320

PUBLICATION 7 C. Panis, G. Laure, W. Lazian, H. Grünbacher, J. Nurmi, �A Branch File for a Configurable DSP Core�, in Proceedings of the International Conference on VLSI (VLSI�03), Las Vegas, Nevada, USA, June 23-26, 2003, pp. 7-12.

©2003 CSREA. Reprinted, with permission, from proceedings of the International Conference on VLSI.

A Branch File for a Configurable DSP Core



G. Laure, W.Lazian Infineon Technologies

[email protected] [email protected]

H. Gruenbacher Carinthian Tech


J. Nurmi Tampere University of


Abstract

Increasing system complexity of System-on-Chip applications leads to an increasing requirement on powerful embedded processors. Low-cost applications do not allow using more than one embedded core and therefore the DSP processor has to handle also control code and configuration sections efficiently. One aspect of control code is an increased number of branch instructions compared with classical DSP functions like FIR filtering. Pipelined processors have obstacles during branch instruction handling: branch delays. Branch delays can lead to a severe performance limitation. A possibility to reduce branch delays is to reduce the number of branch instructions. Conditional branches caused by if-then-else constructs can be removed by using predicated or conditional execution. For efficient use of predicated execution a wide range of conditions has to be available. To build up the conditions status flags are necessary. This paper describes the architecture of a branch file. The branch file contains the status flags, which can be used for conditional branch instructions and for predicated execution. During architecture definition an efficient use by the C-Compiler has been taken into account. The advantage of using a separate branch file is pointed out and the handling of the status flags during exception handling described. The branch file is part of a project for a configurable DSP core.

Keywords Branch File, Predicated execution, conditional

instrcution

1. Introduction

Increasing complexity of SoC (System-on-Chip) applications increases the demand on powerful software programmable embedded processor cores. Low cost applications do not allow supporting more than one processor; therefore the DSP core has to handle an increasing part on control code. To fulfill the requirements of the applications, the reachable clock frequency of DSP cores have been increased by deeper pipelines. The

possibility to execute several instructions in parallel additionally increases the computational bandwidth of embedded processors.

Compared with classical DSP functions like filtering, the control code shows an increased number of branch instructions. Pipelined processors have obstacles executing branch instructions: branch delays [1]. Branch delays are caused by taken branch instructions and a consecutive break in the linear program flow takes place. The instructions located after the branch instruction will be fetched but not executed, or alternatively a number of instructions would have to be moved to these delay slots to be executed independently of the condition evaluation to true or false. One third of the conditional branch instructions are caused by if-then-else constructs and therefore can be removed by predicated execution. Especially at VLIW architectures supporting the execution of several instructions in parallel, predicated execution can prevent severe performance limitations caused by unusable branch delays. To build up different conditions a wide range of status flags is necessary.

The drawback of predicated execution is an increased code effort [2] which is necessary for coding of the condition. For the proposed DSP core this problem has been solved by adding the condition only in code sections, which can make use of the advantages of predicated execution.

This paper describes the architecture and implementation of a branch file, which is a separate register file containing the status flags. The proposed concept is suited for a C-Compiler, which can efficiently make use of the provided predicated execution and the available flags for conditional branch instructions.

In the first section the configurable DSP core is introduced and the implementation of the predicated execution is pointed out. The second part is used to briefly introduce available implementations. The third part covers the architecture of the branch file, introducing the different types of flags which are supported and pointing out the advantage of the chosen implementation during exception handling.

VLSI-03 International Conference 7

2. Motivation

This section is used to illustrate the need of using a branch file. In the first part the proposed configurable DSP core is shortly introduced. Besides conditional branch instructions one use for the status flags located inside the branch file is predicated execution. Therefore the chosen implementation of predicated execution for this DSP core is also illustrated.

2.1. DSP Core Architecture

This sub-section will be used to introduce the configurable DSP core architecture. DSP cores have to handle control code efficiently in terms of code density and cycle count. During architecture definition of the DSP core, the development of an efficient C-compiler has been considered, efficient in that way that the overhead compared with manual assembly coding is less than 10%. The modified dual Harvard architecture has two independent data memory busses, which in the chosen example are 32 bits wide. The instructions will get the source operands from the register file and store the results again into the register file, and the data moves between the register file and data memory are explicitly coded as separate instructions (load/store architecture, as illustrated in Figure 1). The RISC-like pipeline consists of three phases fetch, decode, execute (including write back). To obtain higher clock frequencies each of the phases can be further split over several clock cycles.

register file

memory port

execution unit

Figure 1: DSP Core overview

The register file is split into three parts (illustrated in Figure 2). The data register file contains the data, long and accumulator registers. Two consecutive data registers can be addressed as a long register (e.g. l0), including guard bits as accumulator (e.g. a0). Due to orthogonality requirements of the C-Compiler each of the registers can be used for each of the instructions. The second part

contains the address register e.g. r0 (including modifier register e.g. m0 for modulo buffer and FFT addressing support). The third part is the branch file, which is the topic of this paper.

Figure 2: Register file

In the example of Figure 3 a 5-stage pipeline is considered, splitting the fetch phase over the pipeline stages instruction fetch (IF) and instruction alignment (AL) and the execution phase over EX1 and EX2. The operands are read when they are required. For example a MAC (multiply and accumulate) instruction, calculating the multiply result in EX1 and the ADD operation in EX2.

Figure 3: Pipeline structure

The operands for the multiplier are fetched from the register file in the beginning of EX1, for the add operation (used to sum up the multiply results) the accumulator is fetched at the beginning of EX2. For instructions using only the first execution stage the write back takes place at the end of the pipeline stage EX1. This reduces the define-in-use dependency between the instructions of consecutive execution bundles (execution bundle marks instructions, which are executed in parallel) [2]. The operand fetch 2 enables efficient calculating of filter structures. Although the MAC instruction is split over two clock cycles, the accumulator is fetched in the beginning of pipeline stage EX2. Therefore the results of the MAC operation located in the execution bundle before can be already used (no lost clock cycles due to define-in-use dependency).

8 VLSI-03 International Conference

2.2. Predicated Execution

In [3][4][5] the branch frequency in certain benchmark examples (including Espresso, Gcc and Yacc) is analyzed. The studies distinguish between general purpose and scientific code. Assuming that applications running on an embedded DSP core are comparable with scientific code examples the ratio of branch instructions is about 5-10%, which means that every 10th to 20th instruction is a branch instruction. VLIW DSP architectures support the execution of several instructions in parallel, e.g. for the example architecture of this paper up to five. Thus, the branch distance is getting quite low, about two to ten execution cycles. In control code sections the need for branches can be even considerably higher, approaching the situation where almost every VLIW instruction cycle will execute a branch.

Figure 4: FSEL example

The drawbacks of branch instructions in pipelined processors are branch delays. This can lead to severe performance limitations.

Figure 5: Influence of FSEL

Pnevmatikos and Sohi have shown in [6] that about one third of the branch instructions can be replaced by predicated execution. If-then-else code examples can be efficiently covered by predicated execution. The proposed DSP concept uses FSEL for this purpose. In Figure 4 an example for predicated execution is illustrated. The FSEL instruction is located in an execution bundle (illustrated in Figure 5) such that it enables the remaining instructions in the execution bundle to be executed conditionally.

The FSEL instructions can be used with a wide range of conditions. The bases for this condition are status flags. The proposed DSP concept supports static flags (detailed discussed in sub-section 4.2) and dynamic flags (more

details in sub-section 4.3). The status flags are stored in the branch file. The flags are also used for the conditional branch instructions, which cannot be replaced by predicated execution.

3. Implementation examples

This section is used to illustrate implementation examples of predicated execution on available VLIW DSP cores. The Texas Instruments C62x family and the Starcore SC140 have been chosen to represent typical architectures. Both VLIW DSP architectures provide the possibility to execute several instructions in parallel, and therefore predicated execution is mandatory to prevent a performance lack in code sequences with a high branch frequency. For both processor cores, the available flags will be analyzed concerning their suitability for using a C-Compiler. The same is true for conditional branches, which use the same flags to build up the condition.

3.1. TI C62x

The C62x architecture of Texas Instruments supports predicated execution for each instruction (full guarding) [7]. To obtain full guarding, 3 bits of each instruction word are used to decode the register whose status is needed to generate the condition. The possible registers are B0, B1, B2, A1 and A2. Under certain conditions A0 can also be used. The remaining coding space (with 3 bits it is possible to encode 8 states) is used to encode unconditional execution and one code combination keeps reserved.

Figure 6: TI C62x instruction example (addk - add a constant)

The instruction example in Figure 6 shows the leading three bits labeled creg used to code one of the registers. The z bit following the creg is used to decode, whether the test takes place for equal to zero (z=1) or not equal to zero (z=0).

Each instruction consumes 4 bits to code the condition for the predicated execution, which has influence on the code density. The implementation to encode static registers is useful for the scheduler of the C-Compiler which has a certain freedom of reordering instructions, which is especially necessary for a VLIW architecture supporting the execution of several instructions in parallel. The limitation on a few registers of the register file supporting predicated execution leads to a restricted use of these registers.


3.2. Starcore SC140

The architecture of the SC140 supports full guarding [8]. Instead of spending the code to each instruction the prefix (already used to build up the execution bundle) is used to code the condition. There are two subsets per execution bundle possible (even and odd). In the assembly syntax three instructions are available. IFF is used for instructions of the current set which will be executed, if the flag T is equal to zero. If T is one, the instructions are handled as NOP. The IFT instruction is used for the inverse function. If T is equal to one, the instructions will be executed, if T is equal to zero the instructions will be treated as NOP instructions. The IFA is used for instructions of the same execution bundle, which are executed unconditionally.

The predicated execution implementation of the Starcore SC140 consumes less code space for the predicated execution. The restricted availability of status flags for coding different conditions leads to a significant limitation for efficient scheduling of the instructions (all conditions are dependent on the status of T).

Figure 7: Problem of limited flags

In Figure 7 an example is used to illustrate the related problem for instruction scheduling. In this example it is not possible to schedule any instructions influencing the T-flag between the compare and the conditional instruction.

4. Branch File

This section is used to describe the architecture of the branch file for the proposed DSP core in section 2.1 and to discuss the different types of status flags. At the end, the requirements during exception handling will be mentioned and the advantage of the proposed concept pointed out.

4.1. Architecture

The branch file is a register file, containing the status flags for predicated execution and conditional branch instructions. The branch file has independent read and write ports, with no dependency on the read and write ports of the data and address register files. This enables an independent usage of the register and the related flags

which represent the status of the register content (also as illustrated in Figure 12).

In the predicated execution example of Figure 4 the register d0 is used as destination register for an add (addition) instruction and in the same execution bundle (an execution bundle consists of instructions executed during the same cycle) the status of the register d0 is used to decide if the load operation of the accumulator a4 takes place. Having the status flags as part of the register file the number of read and write ports of the register file has to be doubled, which will reduce the reachable clock frequency.

dataaddress

S Z OFS Z OFZ

Figure 8: Static flags

The update of the status flags takes place when the register contents are stored. At the beginning of the execution stage (in this example at the beginning of EX1, see Figure 3) the condition will be created of the status flags located inside the branch file. Is the condition true the conditional instruction is executed, if not then the instruction is suspended.

Figure 9: Dynamic flags

In Figure 8 the architecture of the branch file for the proposed DSP core (only the part for the static flags) is illustrated. The branch file consists of two parts, a part containing flags associated to registers (static flags) of the register file and a part influenced by the program flow (dynamic flags). The next sub-sections are used to introduce the different flag types and to illustrate related possibilities to increases the core performance.


4.2. Static Flags

Static flags are representing the status of the content of the registers. For each of the registers of the data register file three flags are stored (illustrated in Figure 8). A Zero Flag (Z), a Sign Flag(S) and an Overflow Flag(OF) are stored inside the branch file. The OF flag is not a static flag, but will be assigned to the register, therefore it is already mentioned in this sub-section. If an instruction result exceeds the available data width, the OV flag of the destination register is set. Due to the fact, that the long registers are a concatenation of two data registers the flags of the data registers have to be combined to evaluate the status of the long register. The flags representing the status of the accumulator are stored separately.

The address registers are useful to implement software counter. Therefore the Zero Flag of the address register is evaluated and stored inside the branch file to enable a condition equal to zero.

Using flags related to the destination register of an instruction enables the instruction scheduler of the C-Compiler to reschedule instructions (the dependency between generating the condition and executing the dependent instruction cannot be resolved, but there is no need to keep them together).

4.3. Dynamic Flags

Dynamic flags are related to the program flow. Examples are the OV flag already introduced above and e.g. the loop status flags (FL - first loop ,LL - last loop). Some examples are illustrated in Figure 9. The proposed DSP core supports zero-overhead loop handling. When the loop is executed for the first time, the FL-flag is set. This enables to schedule instructions into the loop body and use predicated execution to execute them only one time.

In Figure 10 a loop example is illustrated. In the chosen example, software pipeline techniques have been already mentioned to reduce the load-in-use dependency (between the load instructions and the related arithmetic instructions). In this example it was possible to hide the load-in-use-dependency, in other code sections this will not be possible with the drawback of introducing NOP instructions; which means wasted clock cycles and wasted instruction space, because also NOP instructions have to be encoded.

In Figure 10 the load operation of the loop body is first time executed before the loop takes place; therefore the loop has to be executed one time less. The add instruction is located directly in front of the loop body (the instruction can be scheduled for this core architecture in parallel to the loop instruction). The add operation will be executed only once.

Figure 10: Loop example

In Figure 11 the same example is shown but this time using the flag FL. The add instruction can be shifted into the loop-body. Still the add operation will be executed only once (during first time executing the loop). Using the flag in combination with predicated execution, it is possible for this example to reduce the number of cycles.

Figure 11: Loop example using LL

The drawback of an increased loop body and therefore the need of additional fetch cycles during loop body execution are compensated by an instruction buffer [9]. This buffer is used to execute loop bodies without the need of further fetch operations from memory which would lead to increased power dissipation. During the last execution of the hardware loop the LL-flag is set. Similarly as for the first loop execution it is possible to schedule instructions into the loop body and execute them only during the last execution cycle of the loop.

For each of the loop nesting levels a pair of loop flags is available, indicated by the number in front of the status flags.

4.4. Exception handling

During exception handling (handling of interrupt service routines or during task switching) the static and dynamic flags are handled differently.

The static flags are updated each time a value will be stored into the register file. Therefore it is not necessary to explicitly save the static flags into data memory during exception handling. E.g. at the beginning of the task switch all registers of the register file are stored into data memory. The register values used for the new task are fetched from memory and stored into the register file. During the store operation into the register file the static flags are automatically updated. Again switching back to the first task only the registers of the second task have to


be stored to data memory and the register contents of the first task are fetched from memory and again stored into the register file. The static flags are again automatically updated. Therefore during exception handling the static flags do not have to be considered. The content of the static flags will be all the time consistent to the register contents.

Figure 12: Handling of static flags

The dynamic flags (including the overflow flag) are dependent on the program flow and therefore it is necessary to take care of them during exception handling. This can be done by regular load/store instructions.

In Figure 12 the above described mechanism is illustrated. In parallel to the update of the registers in the register file the related entries in the branch file are updated. This is true for values fetched values from the data memory as also for results of arithmetic operations. The advantage of generating the flags in parallel is consistency of the register content and the related branch file each cycle, availability of the flags, as soon as they are requested for conditional operations and there is no need for explicit instructions to take care of the static flags during exception handling.

5. Conclusion

The branch file described in this paper is used to store dynamic and static flags. Besides the conditional branch instructions which can make use of the flags of the branch file, predicated execution can be used to reduce the number of branch instructions especially in typical control code examples like if-then-else constructs. The branch file with separate read and write ports enables to use the register content and the status flag independently without

influencing the reachable clock frequency by doubling the read and write ports of the data register file. The register related status flags enable the C-Compiler to reschedule instructions and therefore to reduce the number of necessary execution cycles. During exception handling the static flags do not have to be taken care off. Updating the register content will automatically update the static flags in the branch file. The automatic flag handling reduces the necessary overhead during task switching and handling of interrupt service routines. The branch file is part of a configurable DSP core.

References

[1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, DSP Processor Fundamentals, Architectures and Features, IEEE Press, New York 1997.

[2] Dezso Sima, Terence Fountain, Peter Kacsuk, Advanced Computer Architectures: A Design Space Approach, Addison Wesley Publishing Company, Harlow, 1997.

[3] Lee J.K.F. and Smith A.J., “Branch prediction strategies and branch target buffer design”, Computer, 17(1), 6-22, 1984.

[4] Stephens C.,Cogswell B. Heinlein J., Palmer G. and Shen J.P., “Instruction level profiling and evaluation of the IBM RS/6000”, in Proc. 18th ISCA, pp.137-46, 1991.

[5] Yeh T.-Y. and Patt Y.N., “Alternative implementations of two-level adaptive branch predictions”, iIn Proc. 19th ISCA, pp.124-34

[6] Pnevmatikos D.N. and Soshi G.S.,”Guarded Execution and branch prediction in dynamic ILP processors”, in Proc. 21.ISCA , pp. 120-9, 1994.

[7] Texas Instruments, CPU and Instruction Set Reference Guide, SPRU189B, Texas Instruments, July 1997.

[8] Motorola Inc. and Lucent Technologies Inc. SC140 DSP Core Reference Manual, MNSC140CORE/D, Rev.0, 12.1999.

[9] C.Panis, R.Leitner, H.Grünbacher, J.Nurmi, “xLIW – a Scaleable Long Instruction Word”, ISCAS 2003, Bangkok, Thailand, 05/2003.


PUBLICATION 8 C. Panis, R. Leitner, J. Nurmi, �A Scaleable Shadow Stack for a Configurable DSP Concept�, in Proceedings The 3rd IEEE International Workshop on System-on-Chip for Real-Time Applications (IWSOC), Calgary, Canada, June 30-July 2, 2003, pp. 222-227.

©2003 IEEE. Reprinted, with permission, from proceedings of the IEEE International Workshop on System-on-Chip for Real-Time Applications.

Scaleable Shadow Stack for a Configurable DSP Concept


[email protected]

Raimund Leitner Infineon Technologies

[email protected]

Jari Nurmi Tampere University of Technology

[email protected]

Abstract

SoC (System-on-Chip) applications map complex system functions on a single die. The increasing importance of flexibility in SoC applications leads to a raising portion implemented in firmware. Therefore, the demand on computational power of the embedded processors in the application is increasing. The newest silicon technologies (e.g. 0.13 µm and lower) help to increase the reachable frequency, but the demand cannot be sufficiently satisfied. One approach to increase the processor frequency is the introduction of pipelining. To guarantee data consistency in deep pipelined processors different methods have been developed. Additional complexity is introduced by the occurrence of interrupts. This paper describes a concept to enable data consistency between the instructions of different pipeline stages in pipelined DSP kernels during interrupt service routines, without the interaction of the DSP itself and with no restrictions concerning the nesting level of the interrupts. The scaleable shadow stack is part of a development project for a configurable DSP concept.

1. Introduction

Today’s applications map complex systems to a single die and therefore the complexity of the integrated circuits is increasing. The importance of software/firmware in applications is increasing, mainly driven by higher flexibility requirements, which leads to an increasing relevance of embedded processors [1]. To achieve the frequency requirements for the embedded processors, it is not sufficient only to gain the necessary frequency increase by smaller feature size of the newest CMOS technologies. Architectural features like pipelining (introduced originally in the late 1960’s − IBM 360/91, 1967; CDC 7600, 1970) helps to increase the core frequency by reducing the number of actions during one clock cycle [2][3].

The application requirements on DSP subsystems are changing. On one side the classical number crunching functions like FIR, IIR and FFT will be implemented in dedicated HW.

On the other side many applications especially in the low-cost area do not justify to use more than one processor core.

Therefore the configuration of these dedicated hardware blocks has to be done by the DSP and also due to the missing micro-controller more control code has to be executed. Due to these facts, interrupts are getting more important and generate a lot of additional problems in DSP architectures. Besides real-time requirements, a lean task switch and several interrupt nesting levels have to be supported.

This paper describes a mechanism to overcome the data consistency problem between different pipeline stages for a DSP concept with no restrictions on the nesting level of interrupts. The so called scaleable shadow stack needs no interactions from the DSP kernel itself and therefore no MIPS or program memory has to be spent.

In the first part the DSP architecture will be shortly introduced. The structure of the pipeline will be explained more in detail to illustrate the problems of nested interrupts and data consistency between the different pipeline stages. The second part will be used to exploit the architecture of the scaleable shadow stack, to describe the structure of the stack packets and to outline the swapping mechanism. The third part contains an analysis at which nesting level the proposed concept will lead to an advantage compared with classical approaches.

2. DSP Architecture

This section will be used to shortly introduce the DSP architecture, which is the basis for the scaleable shadow stack. To overcome the requirements of handling control code on a DSP, a C-compiler entry is mandatory. Therefore during architecture definition the development of an efficient C-compiler has been considered (efficient in that way that the overhead compared with manual assembly coding is less than 10%). The modified dual Harvard architecture has two independent data memory busses, in the chosen example 32 bits wide.

0-7695-1944-X/03 $17.00 © 2003 IEEE

222

The instructions will get the source operands from the register file and store the results again into the register file, and the data moves between the register file and data memory are explicitly coded as separate instructions (load/store architecture). The RISC-like pipeline has a split execution stage to prevent a timing critical path inside the execution unit. These two pipeline stages are called EX1 and EX2 as illustrated in Figure 1.

2.1. Pipeline structure

To increase the reachable core frequency of the DSP kernel, the execution stage of the RISC-like pipeline structure has been split into two parts. For example the MAC instruction (Multiply-Accumulate) is split into the Multiply operation calculated during the EX1-stage and the Accumulation operation calculated during the EX2-stage.

To reduce the define-use [4] dependency the operands for the instruction will be fetched from the register file as late as possible. This enables e.g. the implementation of a FIR filter without any stall cycles due to data dependency.

Figure 1: Pipeline structure

For the multiply operation the multiplicand and multiplier will be fetched at the beginning of the EX1-stage, for the final accumulation the operand will be fetched at the beginning of the EX2-stage. At the end of pipeline stage EX2 the result of the MAC instruction will be stored into the register file. In Figure 2 two consecutive cycles are used to illustrate the data dependency between the instructions of different clock cycles (t, t+1).

Figure 2: Data dependency

The operand op1 of the instruction t+1 cannot already contain the result of the instruction t. This enables the use of the same register of the register file for 2 instructions, first to keep the operands for instruction t+1 and then the result of instruction t.

If the register used for op2 of instruction t+1 will be the same as the result register of instruction t, then the result of instruction t will already be used as operand of instruction t+1. This increases the utilization of available registers and decreases the data dependency between the instructions of different pipeline stages.

2.2. Interrupt handling

If an interrupt disrupts the linear program flow an interrupt service routine (ISR), has to be executed. The instructions which are already being served in the pipeline will be finished and their results stored into the register file. Then the ISR will be executed and after finishing the interrupt request the original program flow can be resumed.

The handling of an ISR is illustrated in Figure 3. The results of instruction t will be stored at the end of EX2 and the ISR can be started. After finishing the program of the ISR the instruction t+1 (of Figure 2) will be handled in t+n. The ISR itself takes care of saving and restoring the contents of the registers that it needs to occupy.

Figure 3: Interrupt handling

Due to the delayed execution (from the system point of view) of the instruction t+1 (now t+n) the operand op1 has been changed by instruction t, when writing the result at the end of EX2. Therefore the instruction t+1 (now t+n) will give a different result if the instruction will be executed in t+1 or in t+n (delayed by the execution of the ISR).

Figure 4: Register file

An example with register values is now used to illustrate the problem. In this example a MAC instruction will be followed by an ADD instruction. The register file in this example is built up as in Figure 4. The example is illustrated in Figure 5. Two consecutive (16 bit wide) data registers (e.g. d0, d1) can be accessed as long register l0 (32 bit wide). The long register and the guard bits (gb) together can be addressed as accumulator a0.

223

EX 1 EX 2ID

EX 1 EX 2ID

t

t + 1

read op1 read op2 write back

read op1 read op2 write back

t + n

ISR ... Interrupt service routine

start

end

d0 := 2d1 := 3

d6 := 7d3 := 5 d2 := 4

registervalues :

mac d0,d1,d3

add d6,d2,d4

a3 := 2*3 + 5d4 := 7 + 4

a3 := 2*3 + 5d4 := 11 + 4

Figure 5: Example

In the example of Figure 5 the source register d6 of the ADD instruction will be physically part of the result register a3.

The ADD instruction will get the first operand at the beginning of EX1 and the second at the beginning of EX2 (in order to reduce the number of read ports of the register file). The normal program flow (left column) gives a different result compared to the flow where the ISR has been executed between MAC and ADD instruction (right column).

2.3. Shadow register

One approach to overcome this problem is to save the instructions in cycle t before starting the ISR. The results can be stored in a register during the execution of the ISR. After finishing the interrupt service, the values have to be restored into the register file. If the number of nesting levels for the interrupt handling is more than one, for each nesting level an additional storing place has to be prepared, as shown in Figure 6. Today’s VLIW architectures provide a lot of instruction level parallelism and therefore a lot of possible parallel results.

Figure 6: Shadow register

For each of the instructions which can be executed in parallel and for each nesting level such a field of intermediate registers has to be prepared [5]. This leads to a lot of additional silicon area to handle the data

consistency problem. The number of possible interrupt nesting levels is limited to the available register resources.

3. Scaleable shadow stack

The scaleable shadow stack can be used to solve the described problem in a more elegant way with no restrictions to the nesting level of interrupts. The following section is used to explain the architecture of the stack, the structure of the shadow stack packet in detail and the fine-tuning mechanism, which enables application specific optimization.

3.1. Architecture

Figure 7 illustrates the integration of the shadow stack into the core architecture. If an interrupt service routine has to be executed, the results of the instructions in the last cycle before the ISR are not stored into the register file.

Figure 7: Shadow stack integration

In the example of Figure 7, four possible results have to be saved in the shadow stack. This will be done without any instructions in the DSP program. After finishing the ISR the register contents previously saved in the shadow stack will be restored in the register file of the DSP core. To prevent any data inconsistency as described in the section before, the write-back has to be handled in the correct cycle of the pipeline.

Figure 8: Shadow stack structure

The shadow stack has access to the data memory ports of the DSP subsystem. The already available memory can

224

be used to store parts of the shadow stack content into the memory. This has to be done to free up space for the next results (initiated by the next nested interrupt). When the storing actually becomes necessary depends on the size of the shadow stack. Details of the swapping mechanism will be discussed in the next subsection.

From the core point of view the shadow stack contains the following information:

register content: the result values of the last instructions before the ISR execution

target address: the target register which should be written with the register content after finishing the ISR

interrupt level: if several levels of interrupts are possible, then the assignment of the stack entries to a certain level must be possible

The shadow stack structure is depicted in Figure 8. For the core, the stack looks like a hardware stack, with pointers which handle the stack administration. The begin shadow ptr will be incremented if packets have to be stored on the stack, and decremented if data has to be restored into the register file. The end shadow ptr will be used to handle the data exchange with the memories, while swapping stack contents to the data memories of the DSP core.

3.2. Shadow stack packet structure

Due to the instruction level parallelism in VLIW architectures [3] the shadow stack has to handle several results at the same time (in this example architecture up to 4). Each of the result registers can have different bit width (e.g. 16 bits for data registers and 40 bits for an accumulator). The width of the busses to the data memories of the DSP core is another factor influencing the structure of the shadow stack packet. In the following example illustrated in Figure 9 two busses each 32 bits wide have been assumed. The biggest result register has the size of 40 bits (accumulator).

The interface to store the packets into the data memories is 32 bits wide. Therefore the packet structure is limited to 32 bits. 20 bits will be used for payload (the register content), 7 bits are reserved to encode the destination register. One bit will be needed to identify the next interrupt level and two bits for the previous packet location. A single bit for the interrupt level is sufficient because the shadow stack packets will be served in a FILO method (first in, last out), and this level bit will be used to indicate the start of the next interrupt level. The previous packet location has to be stored, because the next free bus cycle will be used, independent of which of the two busses to the data memory is available.

Figure 9: Packet structure

To compose the data in the right manner, when they will be fetched back from the data memories, it is necessary to know where they have been stored before.

Registers containing more than 20 bits payload have to be split in parts of 20 bits, to fit into the illustrated raster. To dimension the shadow stack for optimal utilization a worst case estimation has to be done (how many shadow stack packets have to be handled for one interrupt level at the same time). This analysis also has to be done to set the thresholds for the swapping mechanism.

3.3. Thresholds

The thresholds of the shadow stack structure can be seen in Figure 10. There are 4 limits, which can be configured to application specific requirements.

• LIL_MIN: is used to guarantee the availability of the results in the stack after executing the current interrupt service routine.

• LIL_MAX: is used to guarantee enough space in the shadow stack to keep the next values if a new interrupt occurs.

• BAL_MIN: is an optional limit which will be used to have a well balanced stack content. If there is no value assigned, it will be set with LIL_MIN.

• BAL_MAX: is an optional limit which will be used to have a well balanced stack content. If there is no value assigned, it will be set with LIL_MAX.

If the stack contains too many entries, additional interrupts can lead to stalling cycles of the DSP core to free space for the next register values.

LIL_MIN

BAL_MIN

BAL_MAX

LIL_MAX

space to restoreat least theresults of themost recentinterrupts

space to store atleast the resultsof the nextinterrupts

Figure 10: Thresholds

225

If the stack is kept empty and all shadow stack packets have been swapped preventively to the data memory, the stack is well prepared for new interrupts. However, the register contents for finishing the interrupt service routine are missing. If the shadow packets have to be fetched, this will lead again to stalling cycles and to additional power dissipation due to the switching at the data memory ports. For balanced use of the stack, the two limits BAL_MIN and BAL_MAX can be configured individually.

The FSM (finite state machine) handling the swapping mechanism of the shadow packets is illustrated in Figure 11. As long as there is enough space in the shadow stack available to save the new values and the values are available for restoring after finishing the ISR, only the states Save Results and Restore Results will be used. If there is no space left for the next interrupt level, the state Store LIL will be used to free space inside the stack. If the result values of the already serving ISR are not available inside the stack, the state Load LIL will be used to initiate a fetch from the data memory of the DSP subsystem. Both operation states have no influence on the performance of the DSP kernel, because the memory operations will be done during memory cycles not used by the DSP anyway, as explained in the next subsection. If the next interrupt occurs and there is too little space available inside the stack the state Store Stalling will be used to stall the DSP program and to free the stack until enough free space is available. If the ISR is finished and the related register contents are not available to restore them into the register file, the state Load Stalling will be used to fetch the missing data from the data memory. Both states have influence on the DSP due to the stall mechanism.

3.4. Cycle stealing

The scalable shadow stack has access to the ports of the data memory to swap stack contents to the memory of the DSP subsystem. To prevent stall cycles where the program running on the DSP core will be stalled to free the data busses, cycle stealing [2] is used to regulate the communication on the data bus. The DSP core is the master of the data busses and indicates via a signal, which of the memory busses is not used during the next cycle. During the state Store LIL the related bits in the shadow stack packet (previous packet location) are set and the free bus is used for storing the shadow stack packet in the data memory. During the state Load LIL the previous packet location of the shadow stack packets is analyzed (to get the information on which bus the next value has been stored before) and the data are fetched from the data memory of the DSP subsystem.

Figure 11: FSM for swapping mechanism

If an interrupt service routine finishes and the related shadow stack packets are not available inside the shadow stack, the shadow stack has the option to claim the data bus of the DSP core and to fetch the necessary packets. During this emergency fetching, the program flow on the DSP core will be stalled.

4. Comparison

The minimal size of the shadow stack is one level of results. A level of results consists of the result values generated by the last instructions before executing the ISR. Such a small stack requires swapping of the stack content each time an interrupt occurs or an ISR has been finished. Additionally during the swapping no further interrupts are allowed which will increase the response time. In the example above 4 cycles are needed for swapping the data and 4 cycles to restore them.

Having a stack with 2 levels of results no advantage compared to the shadow registers is still gained, concerning silicon area. However, in this configuration each interrupt will not cause swapping to the data memory of the DSP subsystem.

226

Figure 12: Comparison of shadow registers

versus shadow stack

Assuming a stack size of 3 levels of results, the size of the stack is big enough to calculate with statistics. Not each of the units executing instructions before an interrupt occurring will have valid results. Therefore even in a 3-level shadow stack more than 3 nesting levels can be handled, without any swapping to the data memory. No further registers have to be spent for handling of deeper nesting levels.

In Figure 12 the trade-off between silicon area and possible nesting level is illustrated for the shadow registers and for the solution of the shadow stack. For the first 2 nesting levels the shadow stack has an overhead due to the stack administration. About at the third level the break-even point is reached. Further nesting levels will not lead to additional HW in the case of the shadow stack, in the worst case some stack packets have to be spilled to the data memory. The solution with the shadow registers will require additional HW for the support of further nesting levels.

The scaleable architecture enables an application-specific adaptation of the shadow stack architecture. The following parameters can be influenced. • number of entries of the stack • thresholds for swapping administration • number of memory ports • bit width of the memory ports • number of result registers per interrupt level • size of the result registers (bit width)

Most of these parameters have influence on the consumed silicon area and the power dissipation.

A FSM is used to balance the fill level to reduce the power dissipation at the memory ports and to prevent stalling cycles of the DSP core.

5. Conclusion

The scaleable shadow stack is a smooth solution to overcome the problem of data consistency between the instructions of different pipeline stages during interrupt service routines, without any restrictions on the possible nesting level of the interrupts. The proposed concept enables data consistency without the usage of DSP instructions or cycles and has an area advantage compared to a shadow register concept at deeper nesting levels of interrupts.

The scaling parameters can be used to adapt the shadow stack structure to application specific requirements. Different thresholds to administrate the swapping mechanism to the data memory can be used to reduce the power dissipation by less memory accesses.

The shadow stack is a part of a development for a configurable DSP concept.

6. References

[1] Eyre and J.Bier, The evolution of DSP processors, IEEE Signal Processing Magazine, vol. 17, no. 2, 2000.

[2] J. L. Hennessy and D. A. Patterson, Computer Architecture. A Quantitative Approach, Morgan Kaufmann Publishers, San Mateo CA, 1990.

[3] P.Lapsley, J.Bier, A.Shoham and E.A.Lee. DSP Processor Fundamentals, Architectures and Features. IEEE Press, New York, 1977.

[4] Dezso Sima, Terence Fountain, Peter Kacsuk, Advanced Computer Architectures: A Design Space Approach, Addison Wesley Publishing Company, Harlow, 1997.

[5] Silvia M.Mueller, Wolfgang J.Paul, Computer Architecture: Complexity and Correctness, Springer, New York, 2000.

227

PUBLICATION 9 C. Panis, J. Hohl, H. Grünbacher, J. Nurmi, �xICU - a Scaleable Interrupt Unit for a Configurable DSP Core�, in Proceedings 2003 International Symposium on System-on-Chip (SOC�03), Tampere, Finland, November 19-21, 2003, pp. 75-78.

©2003 IEEE. Reprinted, with permission, from proceedings of the 2003 International Symposium on System-on-Chip.

xICU – an Interrupt Control Unit for a configurable DSP Core



J.Hohl Infineon Technologies

Austria [email protected]

H. Gruenbacher Carinthian Tech


J. Nurmi Tampere University of


Abstract

Increasing complexity of SoC applications leads to a strong demand on powerful software programmable embedded cores. Low-cost applications do not allow adding more than one core to the application. Depending on the application a DSP or a microcontroller will be used. Therefore DSP cores have to handle interrupts typically served by microcontroller sub-systems also with low latency and small overhead concerning cycle count and code density.

This paper describes the architecture of an ICU (interrupt control unit) for a configurable DSP core. The main architectural features of the ICU can be configured to reduce the consumed silicon area to an application specific optimum. Priority morphing is introduced to enable the control of the execution order of pending interrupt sources during run-time and to prevent the loss of interrupt information. A smooth integration into the program sequencer allows short interrupt latency and low overhead for serving ISRs (interrupt service routines). xICU is part of a project for a configurable DSP core.

1. Introduction

Integration of system solutions onto one die (System-on-Chip) or into one package (System-in-a-package) leads to a strong demand onto powerful embedded software programmable cores. This is true for microcontroller and DSP cores. Especially for low-cost applications it is not possible to add more than one core. Depending on the algorithms implemented in software, a microcontroller or a DSP core is chosen. Therefore DSP cores also have to handle efficiently control code and configuration code sections [1]. Interrupts has to be served as by microcontroller architectures, which means with low latency and less overhead in cycle count and code density [2].

This paper describes the architecture and integration of an interrupt controller unit (ICU) for a configurable DSP core. The first section illustrates the changing importance of serving interrupts in DSP subsystems. The second part briefly introduces the requirements to the ICU architecture with focus on priority morphing, which is used to control the execution order of the interrupt sources during run-time and to prevent the loss of interrupt information. The

third part briefly covers the architectural details of the integration into the proposed core architecture.

2. Motivation

This section is used to briefly introduce the features of interrupt control units of available core architectures. The chosen examples are illustrating the increasing importance of interrupts for DSP sub-systems. The OAK DSP core from DSPGroup does not provide a separate interrupt control unit. As state-of-the-art architecture the ICU of the Starcore SC140 is introduced. As reference for ICUs the Power Quicc MPC 680 from Motorola is introduced.

2.1. OAK DSP Core

The OAK DSP core from DSPGroup is chosen as example for traditional DSP core architectures [3] not supporting an explicit interrupt control unit. The interrupts (one non mask-able and three mask-able interrupts) are directly connected to the DSP core, the activation of the pins is high-level sensitive.

Figure 1: OAK DSP Core

The OAK DSP core supports context switching used during serving of interrupt service routines (ISR). For the accumulator register a set of shadow registers is available to handle the second task; therefore no initialization procedure is mandatory. The limitation of one set of shadow registers (supporting more than one has significant influence on the core area) does not allow using of the context switching mechanism for nested interrupts. The OAK DSP is not restricted by this limitation because there is no support of nesting of mask-able interrupts, which are controlled by the IE bit.

2.2. Starcore SC140

The implementation of the exception handling for the Starcore SC140 is split into two parts [4]. The PIC

0-7803-8160-2/03/$17.00 © 2003 IEEE

75

(programmable interrupt controller) is used for prioritization and arbitration of the different exception sources and is not part of the DSP core. Up to 7 different levels of priority are supported (not including the NMI).

Figure 2: SC140-PIC Interface

The interface of the PIC to the DSP core is illustrated in Figure 2. Exceptions can be generated by traditional external interrupt sources (“the left side” in Figure 2) or from the PSEQ unit of the DSP core e.g. by execution of illegal instructions. The address of the interrupt vector consists of three parts, based on the vector base address (stored in the VBA), an offset (internal or external, depending on the kind of exception) and an aligned start-address, initialized with zero (the lower 6 bits, because the distance between 2 exception vectors is 64 byte).

The PIC supports up to 32 interrupt requests including NMI’s (non maskable interrupt). Up to eight edge-triggered NMI’s are supported, the remaining 24 inputs are supporting edge-or level triggered interrupt requests. The PIC provides the possibility to monitor pending interrupts, which is useful for debugging purposes.

The SC140 supports delayed return from exception handling, which allows making use of some of the branch delays of the return from interrupt instruction (up tp two out of six branch delays).

Figure 3: CPM (CPIC)

2.3. Motorola Power Quicc

The exception handling for the Motorola Power Quicc is chosen because this microcontroller is available in several applications today [5]. The interrupt controller for the MPC860 is called CPM. The CPIC (a part of the CPM) is used for synchronization and priorisation of all internal and external interrupts sources, similar as the PIC for SC140 (illustrated in Figure 3).

For maskable interrupt sources a mask register is available containing the mask information used during the selection process of the next interrupt source. The priority of the different interrupt sources can be assigned by mapping of the interrupt sources to a priority matrix. The assigned numbers must be unique.

Comparing the exception handling for a microcontroller with the exception handling for a state-of-the art DSP core the differences are neglect able. Both architectures have a predefined configuration. The number of supported interrupt sources, the number of available priority levels, the size of the address bus and the related address space are fixed.

xICU, the interrupt control unit introduced in this paper allows to modify these parameters to application specific requirements to reduce the consumed silicon area.

3. xICU Requirements

This section is used to introduce the requirements for xICU. The main part is used to illustrate priority morphing (and the specific implementation), which is used to prevent loss of interrupt information and allows changing of the order of served ISRs (interrupt service routine) during run-time.

3.1. Synchronization

Interrupt sources can generate asynchronous interrupt signals (e.g. from different clock domains). Therefore the ICU has to take care for synchronization of the incoming signals (as the PIC of SC140).

3.2. Scalability

The proposed DSP core architecture enables to adapt the main architectural features to application specific requirements to obtain an optimum in power dissipation and area consumption. The same is required for xICU. The number of interrupt sources, the number of supported priority levels (with influence on the size of the related configuration registers) and the width of the interface to the core itself has to be scaleable.

3.3. Priority

Different interrupt sources can be assigned to different priority levels. There is no restriction for the number of priority levels and the priority of an interrupt source has to be changeable during run-time. The same priority level can be assigned to more than one source at the same time. If two sources have the same priority assigned, the time of occurrence is used to decide about the order of execution.

3.4. Low latency

As pointed out in the introduction an important requirement of efficient interrupt handling is low latency.

76

The latency is influenced by the synchronization logic and by core restrictions. The proposed DSP core limits the handling of ISR’s only during execution of branch delays (for the five stage pipeline configuration two branch delays are necessary).

3.5. Priority Morphing

Similar concepts are known from microprocessor architectures. The priority is changing on time-basis with wrap around. Starting with low priority after a predefined time-basis the priority of a certain interrupt source is increased. Reaching the highest priority and getting not served, the process is started again with the lowest priority.

Microcontroller architectures allow minimizing of the average execution time. Real-time requirements of DSP algorithms require minimizing the worst case execution time [6]. Therefore the feature has to be adapted to this aspect.

Enabling the handling of ISRs during execution of code sequences has influence on the execution time. One possibility to overcome this problem during the execution of real-time critical code sections can be disabling of interrupt sources. Another possibility is to execute these code sections in interrupt service routines with a high priority.

Figure 4 is used to illustrate a possible scenario with a DSP core and several Co-Processors handling specific functions. Each of the Co-Processors is controlled by the DSP core. Each Co-Processor can request a communication channel with the DSP core by using an interrupt line. Data transfer between Co-processors can be initiated without interaction of the DSP core, which prevents having the core as bottle neck in the data flow. In Figure 4 an external interrupt source is indicated by an arrow (e.g. a relay on a board), generating interrupts with low frequency. The interrupt source has a low priority, because the relay is working at a lower frequency then the core or the Co-processors.

Figure 4: application example

But if the interrupt request is not served before the next one of the same interrupt source is requested, an interrupt get lost. To overcome this problem the above mentioned priority morphing can be used. Before loosing information, the priority level of the source is increased and the interrupt request gets served earlier. There is no automatic wrap around of the priority supported.

During run-time the assigned priority level can be changed, increased and decreased. The change can be done by a predefined time basis (the time basis can be changed during run time). For example each millisecond the priority of the external interrupt in Figure 4 is increased by one.

The priority level can be also changed by the source to get earlier served. This “self-tuning” can get useful if the interrupt source indicates loosing data or wants to activate data transfer with another Co-processor as mentioned before. This feature has influence on the program flow and therefore has to explicitly activated by the DSP core.

The DSP core can also change the priority of an interrupt source. The increased priority level of an interrupt source leads to an earlier execution.

4. xICU Architecture

This section is used to introduce the architecture of xICU. The first part covers an overview of the structure, the second part discusses topics like interfacing to the DSP core and data consistency aspects.

4.1. Overview

In Figure 5 an overview of the xICU architecture is illustrated. Four main blocks can be identified: The synchronization unit, the scheduling unit, the interrupt setup unit and the feedback unit. The configuration parameter for each source can be assigned via pin-strapping and is also software programmable. Default after reset is pin-strapping.

Figure 5: xICU Overview

• Synchronization unit The synchronization unit is used to synchronize the

asynchronous interrupt sources. Each of the interrupt sources can be edge or level triggered (stored in a configuration register). For each interrupt source only one interrupt can be pending. The next interrupt of the same source is accepted after serving the already pending one. To reduce interrupt latency it is possible to use optional both clock edges for synchronization.

• Scheduling unit The scheduling unit is used to priories pending

interrupt requests. The interrupts with the highest priority will be served first. If two pending interrupts are assigned

77

with the same priority the time of occurrence is chosen for deciding the execution order: first in, first out.

As mentioned in section 3 xICU is supporting priority morphing. Therefore the priority of an interrupt source can be changed during run-time by the source itself, by the core or by defined time-basis. After reset priority morphing is deactivated, the priority level indicated by the pin strapping is used. After setting the related configuration register with the morphing mode (time, source, master) a separate bit is used to activate this feature (for each interrupt source separately).

This unit is also responsible to take care of the masking information. Masked interrupts will not be considered for scheduling as long as the mask bit is set. After reset all mask bits are set.

• Interrupt Setup Unit The chosen interrupt source has to be translated into an

interrupt vector address. There are no predefined memory addresses for interrupt vectors. The address space of the interrupt vector table is configurable.

An interrupt is handled like a call of a sub-routine with an externally provided address. Therefore the xICU has to provide the start-address of the ISR and an interrupt request signal, which is executed like a branch sub-routine instruction. The branch delays of the “external” branch sub routine cannot be used (two clock cycles for the five stage pipeline) due to core restrictions. For the same reason during execution of branch delays no further interrupt requests (e.g. by an NMI) can be served.

Figure 6: Control Signals

The interrupt setup unit is also responsible for nested interrupts. The scaleable core architecture supports any nesting level. The acknowledge signals are assigned to the related interrupt source.

• Feedback Unit The feedback unit is responsible to get feedback from

the other units and from the core and to provide control signals to the interrupt sources. Some of the signals are illustrated in Figure 6 e.g. the ack feedback signal indicating that the related interrupt has been served or the error signal, indicating, that an interrupt is pending and another interrupt of the same source has been activated.

4.2. Core Integration

The DSP core using xICU enables smooth handling of interrupts with low latency (about 4 core clock cycles,

depending on the chosen synchronization mode and the chosen pipeline structure). The core handles ISR like a branch sub-routine with an external provided branch address.

The DSP core supports split execution, which means that more than one clock cycle is used to execute instructions. To increase the usage of the available hardware resources the pipeline is visible, which leads to data consistency problems during execution of interrupt service routines. A solution is illustrated in [7].

Due to the support of nested interrupts no shadow registers as for the OAK DSP core are available. A complete software task switch (which is quite often not necessary for a ISR) consumes about ten clock cycles.

5. Results

The described interrupt control unit (xICU) has been implemented in VHDL-RTL. The main architectural features influencing the consumed silicon area like the number of interrupt sources, the size of the control registers and the supported interrupt priorities are configurable. The timing critical part is the interface to the DSP core (running on core frequency). Therefore the interface is decoupled from the remaining implementation.

6. Conclusion

This paper describes the architecture of xICU, a scaleable interrupt control unit. Compared with available ICUs the main architectural features can be configured to meet application specific requirements. Priority morphing enables changing the execution order of the interrupt service routines during run-time and to prevent a loss of interrupt sources due to starvation. xICU is part of a project for a configurable DSP core.

7. References

[1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, DSP Processor Fundamentals, Architectures and Features, IEEE Press, New York, 1997.

[2] D.Sima, T.Fountain, P.Kacsuk, Advanced Computer Architectures: A Design Space Approach, Addison Wesley Publishing Company, Harlow, 1997.

[3] Siemens, OAK DSP Core, Programmers Reference Manual, Siemens AG, Munich, 1998.

[4] Motorola, SC140 DSP Core Reference Manual, Motorola, Rev.0, 1999.

[5] Motorola, CPM Interrupt Controller, Motorola, 2002. [6] J. L. Hennessy, D. A. Patterson, Computer Architecture. A

Quantitative Approach, Morgan Kaufmann Publishers, San Mateo CA, 1996.

[7] C.Panis, R.Leitner, J.Nurmi, “Scaleable Shadow Stack for a Configurable DSP Concept”, IWSOC 2003, Calgary, Canada, June 2003, pp 222-227.

78

PUBLICATION 10 C. Panis, G. Laure, W. Lazian, A. Krall, H. Grünbacher, J. Nurmi, �DSPxPlore � Design Space Exploration for a Configurable DSP Core�, in Proceedings International Signal Processing Conference (GSPx), Dallas, Texas, USA, March 31- April 3, 2003, CD-ROM.

©2003 Global Technology Conferences. Reprinted, with permission, from proceedings of the International Signal Processing Conference.

�(XURSDVWUDVVH� �� $�� 9LOODFK�� $XVWULD�� 6LHPHQVVWUDVVH� �� $��9LOODFK�� $XVWULD�� $UJHQWLQHUVWUDVVH� �� $��9LHQQD�� $XVWULD��3�2�%R[��),1��7DPSHUH��)LQODQG��

�

DSPxPlore – Design Space Exploration for a Configureable DSP Core

Christian Panis1 Carinthian Tech Institute +43 4242 90500 2124

[email protected]

Herbert Grünbacher1 Carinthian Tech Institute +43 4242 90500 2100

[email protected]

Gunther Laure2 Infineon Technologies

+43 4242 305 0

[email protected]

Andreas Krall3 Vienna University of Technology

+43 1 58801 18511

[email protected]

Wolfgang Lazian2 Infineon Technologies

+43 4242 305 0

[email protected]

Jari Nurmi4 Tampere University of Technology

+358 331153884

[email protected]

��62&� �6\VWHP�RQ�&KLS�� DSSOLFDWLRQV� PDS� FRPSOH[� V\VWHP�

IXQFWLRQV� RQ� D� VLQJOH� GLH��7R�FRYHU� WKH� LQFUHDVLQJ� LPSRUWDQFH�RI�

IOH[LELOLW\� LQ� 62&� DSSOLFDWLRQV� D� UDLVLQJ� SRUWLRQ� ZLOO� EH�

LPSOHPHQWHG� LQ� VRIWZDUH��7KHUHIRUH� WKH� LPSRUWDQFH�RI�HPEHGGHG�

SURFHVVRUV�OLNH�PLFURFRQWUROOHUV��SURWRFRO�SURFHVVRUV�DQG�'63V�LV�

LQFUHDVLQJ��

%XW�ZKLFK�RQH�LV�WKH�ULJKW�FRUH�WR�FRYHU�WKH�GHPDQGV�RI�D�FHUWDLQ�

DSSOLFDWLRQ"�0RVW� RI� WKH� WLPH� WKLV� GHFLVLRQ�ZLOO� EH� GRQH� E\� WKH�

PRVW�H[SHULHQFHG�HQJLQHHUV�PDLQO\�IRFXVLQJ�RQ�WKH�DVSHFWV�ÄZKDW�

LV� DOUHDG\� DYDLODEOH"´� DQG� ÄZKDW� KDV� EHHQ� DOUHDG\� SURYHQ� LQ�

VLOLFRQ"´�� 'XH� WR� WKH� GLIIHUHQW� UHTXLUHPHQWV� RI� GLIIHUHQW�

DSSOLFDWLRQV� RQH� FRUH� FDQQRW� ILW� RSWLPDOO\� HYHU\ZKHUH� DQG� WKLV�

OHDGV� TXLWH� RIWHQ� WR� VROXWLRQV� ZLWK� RYHUKHDG� FRQFHUQLQJ� VLOLFRQ�

DUHD� DQG� SRZHU� FRQVXPSWLRQ�� ,Q� WKH� SULFH�FULWLFDO� FRQVXPHU� ,&�

PDUNHW� WKLV� FDQ� EH� FUXFLDO� IRU� WKH� RZQ� PDUNHW� SRVLWLRQ� DQG�

UHYHQXHV��

'63[3ORUH�LV�D�GHVLJQ�VSDFH�H[SORUDWLRQ�PHWKRG�IRU�DQ�HPEHGGHG�

FRQILJXUDEOH�'63�SURFHVVRU��ZKLFK�FDQ�EH�DGDSWHG�WR�DSSOLFDWLRQ�

VSHFLILF� UHTXLUHPHQWV�� 7KH� PDMRU� SLOODUV� RI� '63[3ORUH� DUH� DQ�

RSWLPL]LQJ�&�FRPSLOHU� DQG� LQVWUXFWLRQ� VHW� VLPXODWRU�� EDVHG� RQ� D�

FRQILJXUDEOH� FRPSRQHQW� IUDPHZRUN� DQG� RQ� D� FRQILJXUDEOH� '63�

FRUH�DUFKLWHFWXUH��

'63[3ORUH� HQDEOHV� WKH� HYDOXDWLRQ� RI� DUFKLWHFWXUDO� DVSHFWV� RI� D�

'63�FRUH�DQG�WKHLU�LQIOXHQFH�RQ�WKH�RYHUDOO�V\VWHP�SHUIRUPDQFH�DW�

DQ� HDUO\� VWDJH� RI� WKH� SURMHFW�� 7KLV� FDQ� EH� XVHG� IRU� D� EHWWHU�

XWLOL]DWLRQ�RI� WKH�VLOLFRQ�DUHD�DQG�WR�UHGXFH�WKH�SRZHU�GLVVLSDWLRQ�

RI�WKH�DSSOLFDWLRQ��'63[3ORUH�LV�SDUW�RI�D�GHYHORSPHQW�SURMHFW�IRU�

D�FRQILJXUDEOH�'63�FRQFHSW��

�� 62&�,3�'HVLJQ��

�� 3HUIRUPDQFH��'HVLJQ��([SHULPHQWDWLRQ�

�� '63��'HVLJQ�VSDFH�H[SORUDWLRQ��&RQILJXUDEOH�&RUH��'63[3ORUH�

�� !��"�#�� "!�7KH� LQFUHDVLQJ� LPSRUWDQFH� RI� 6:� OHDGV� WR� DQ� LQFUHDVLQJ�

LPSRUWDQFH� RI� HPEHGGHG� SURFHVVRUV�� OLNH� PLFUR�FRQWUROOHUV��

SURWRFRO� SURFHVVRUV� DQG� '63¶V�� &KRRVLQJ� WKH� ULJKW� RQH� LV� TXLWH�

GLIILFXOW��(DFK�RI�WKH�FRUH�YHQGRUV�FODLP�WR�KDYH�WKH�EHVW�RQH��EXW�

WKH�EHVW�IRU�³ZKDW´"�7KH�TXHVWLRQ�KDV�WR�EH�³ZKLFK�LV�WKH�EHVW�FRUH�

IRU� P\� DSSOLFDWLRQ� UHTXLUHPHQWV"´�� 7KH� 62&� UHTXLUHPHQWV�

FRQFHUQLQJ�VLOLFRQ�DUHD�DQG�SRZHU�GLVVLSDWLRQ�KDYH�OHDG�LQ�WKH�ODVW�

\HDUV� WR�PRUH� DSSOLFDWLRQ� VSHFLILF� FRUH� DUFKLWHFWXUHV� DQG� DOVR� WR�

FRQILJXUDEOH�DQG�DSSOLFDWLRQ�DGDSWDEOH�HPEHGGHG�FRUHV��

+DYLQJ� DQ� DSSOLFDWLRQ� VSHFLILF� DGDSWDEOH� H�J�� '63� FRUH� ZLOO�

HQDEOH� \RX� WR� RSWLPL]H� WKH� FRQVXPHG� VLOLFRQ� DUHD� DQG� SRZHU�

GLVVLSDWLRQ� RI� \RXU�'63� VXEV\VWHP�DQG� WKHUHIRUH� HQDEOHV� \RX� WR�

GHVLJQ� FRPSHWLWLYH� 62&�SURGXFWV�� 7R�PDNH� XVH� RI� WKH� DYDLODEOH�

IOH[LELOLW\�DQG� WR�FKRRVH� WKH�ULJKW�+:�6:�SDUWLWLRQLQJ��ZKLFK�LV�

EHFRPLQJ� RQH� RI� WKH� NH\� LVVXHV� LQ� PDNLQJ� VXFFHVVIXO� SURGXFWV��

\RX�KDV� WR�XQGHUVWDQG�WKH�UHTXLUHPHQWV�RI�\RXU�DSSOLFDWLRQ�LQ�DQ�

HDUO\�VWDJH�RI�WKH�SURGXFW�GHVLJQ�F\FOH��

7KLV�SDSHU� LQWURGXFHV�'63[3ORUH��ZKLFK�VKRXOG�KHOS�WR�RSWLPL]H�

D�FRQILJXUDEOH�'63�FRUH�DUFKLWHFWXUH�WR�WKH�VSHFLILF�UHTXLUHPHQWV�

RI� D�FHUWDLQ�DSSOLFDWLRQ�� IRFXVLQJ�RQ� WKH�PDLQ� LVVXHV� VLOLFRQ�DUHD�

DQG� SRZHU� GLVVLSDWLRQ�� '63[3ORUH� LV� EDVHG� RQ� DQ� HYDOXDWLRQ�&�

FRPSLOHU� DQG� D� UH�FRQILJXUDEOH� LQVWUXFWLRQ� VHW� VLPXODWRU� �,66��

EDVHG�RQ�D�FRPSRQHQW�IUDPHZRUN��

,Q� WKH� ILUVW� VHFWLRQ� WKH� FRQILJXUDEOH� '63� DUFKLWHFWXUH� LV� VKRUWO\�

LQWURGXFHG�� 7KH� VHFRQG� SDUW� ZLOO� IRFXV� RQ� WKH� H[SORUDWLRQ�

SDUDPHWHUV� DQG� WKHLU� LQIOXHQFH� RQ� VLOLFRQ� DUHD� DQG� SRZHU�

�

�

�

GLVVLSDWLRQ��7KH�WKLUG�SDUW�LV�XVHG�WR�LQWURGXFH�'63[3ORUH�DQG�WR�

H[SODLQ�WKH�DQDO\VLV�UHVXOWV�RI�WKH�HYDOXDWLRQ�WRRO�FKDLQ��

$�� %��& �'��#�'�7KLV� VHFWLRQ� ZLOO� EH� XVHG� WR� VKRUWO\� LQWURGXFH� WKH� FKRVHQ� '63�

DUFKLWHFWXUH�� 7KH� FKDQJLQJ� UHTXLUHPHQWV� RQ� '63� VXEV\VWHPV� LQ�

WKH� DUHD�RI�62&�DSSOLFDWLRQV� OHDGV� WR�D� VWURQJ�GHPDQG�RI�D�KLJK�

OHYHO� ODQJXDJH�HQWU\� OLNH�&�RU�-DYD��WKH�PDQXDO�FRGLQJ�HIIRUW� IRU�

ODUJ� DSSOLFDWLRQ� FRGHV� LV� LQFUHDVLQJ� H[SRQHQWLDOO\�� %XW� WKH�

UHTXLUHPHQWV� FRQFHUQLQJ� VLOLFRQ� DUHD� FRQVXPSWLRQ� DQG� SRZHU�

GLVVLSDWLRQ� GR� QRW� DOORZ� RYHUKHDG� IRU� D� FRGH� DXWRPDWLFDOO\�

JHQHUDWHG�E\� D�&�FRPSLOHU��7KHUHIRUH�� WKH�DUFKLWHFWXUH�GHILQLWLRQ�

KDV� EHHQ� LQIOXHQFHG� E\� UHTXLUHPHQWV� RI� DQ� HIILFLHQW� &�FRPSLOHU�

�HIILFLHQW� LQ� WKH� ZD\� WKDW� WKH� RYHUKHDG� FRPSDUHG� ZLWK� PDQXDO�

DVVHPEO\�FRGLQJ�LV�OHVV�WKDQ��

$�� "( (��7KH� ��ELW� IL[HG� SRLQW�� PRGLILHG� 'XDO� +DUYDUG� DUFKLWHFWXUH� KDV�

WZR� LQGHSHQGHQW� GDWD�PHPRU\�EXVVHV� �H�J��ELW�ZLGH�� >�@��7KH�

LQVWUXFWLRQ�ZLOO�JHW� WKH�VRXUFH�RSHUDQGV�IURP�WKH�UHJLVWHU�ILOH�DQG�

WKH� GDWD� PRYHV� EHWZHHQ� WKH� UHJLVWHU� ILOH� DQG� GDWD� PHPRU\� DUH�

H[SOLFLWO\� FRGHG�DV� VHSDUDWH� LQVWUXFWLRQV��ORDG�VWRUH�DUFKLWHFWXUH��

7KH�5,6&�OLNH�SLSHOLQH�FRQVLVWV�RI�WKUHH�SKDVHV��LQVWUXFWLRQ�IHWFK��

LQVWUXFWLRQ�GHFRGH�DQG� LQVWUXFWLRQ�H[HFXWLRQ��ZKLFK�FDQ�EH�VSOLW�

RQWR�VHYHUDO�FORFN�F\FOHV�WR�LQFUHDVH�WKH�UHDFKDEOH�IUHTXHQF\��$Q�

H[WHQVLYH� VHW� RI�SUHGLFDWHG�H[HFXWLRQ� IHDWXUHV�� D� OHDQ� WDVN� VZLWFK�

VXSSRUW� DQG� QR� OLPLWV� RQ� WKH� QHVWLQJ� OHYHO� RI� LQWHUUXSWV� DOORZV�

HIILFLHQW�KDQGOLQJ�RI�FRQWURO�FRGH�VHFWLRQV��

Port A Port B

Inst

ruct

ion

Buf

fer

Register Files

Execution Units

Data Memory

Pro

gram

Mem

ory

�

)LJXUH��&RUH�$UFKLWHFWXUH�2YHUYLHZ�

7KH� UHTXLUHPHQWV� RI� WKH� HYDOXDWLRQ� &�&RPSLOHU� RI� XVLQJ� DQ�

RUWKRJRQDO� LQVWUXFWLRQ� VHW� DQG� WR� VXSSRUW� HIILFLHQW� VWDFN� IUDPH�

DGGUHVVLQJ� KDV� EHHQ� FRQVLGHUHG� >�@�� 7KH� ODUJH� XQLIRUP� UHJLVWHU�

VHWV�� VLPSOH� LVVXH� UXOHV� DQG� WKH� DEGLFDWLRQ� RI� PRGH� GHSHQGHQW�

LQVWUXFWLRQV� HQDEOHV� HIILFLHQW� PDFKLQH� FRGH� JHQHUDWHG�

DXWRPDWLFDOO\��

7KH�LQVWUXFWLRQ�FDQ�EH�VHW�XS�RI�RQH�RU�WZR�LQVWUXFWLRQ�ZRUGV��7KH�

VHFRQG� ZRUG�� DOVR� FDOOHG� SDUDOOHO� ZRUG�� LV� XVHG� WR� NHHS� ORQJ�

RIIVHWV�� LPPHGLDWH� YDOXHV� RU� IDU� EUDQFK�WDUJHWV� >�@�� 7KH� QDWLYH�

LQVWUXFWLRQ� VL]H� LV� �� ELW�� ZKLFK� HQDEOHV� WR� FRGH� WKH� ZKROH�

LQVWUXFWLRQ�VHW� LQ�VKRUW� LQVWUXFWLRQV�DQG�XVLQJ�WKH�ORQJ�ZRUG�RQO\�

IRU� FRQVWDQWV�� ,Q� VXEVHFWLRQ� �� WKH� SRVVLELOLW\� RI� FKDQJLQJ� WKH�

QDWLYH� VL]H� DQG� WKH� LQIOXHQFH� RQ� WKH� V\VWHP�SHUIRUPDQFH�ZLOO� EH�

SRLQWHG�RXW��7KH�H[HFXWLRQ�EXQGOH�LV�EXLOW�XS�RI�QDWLYH�LQVWUXFWLRQ�

ZRUGV�� WKH� VL]H� RI� WKH� H[HFXWLRQ� EXQGOH� LV� RQO\� OLPLWHG� E\� WKH�

QXPEHU�RI�DYDLODEOH�GDWD�SDWKV��VFDOHDEOH�ORQJ�LQVWUXFWLRQ�ZRUG��

%HVLGHV� WKH�DVSHFW�RI�DQ�DUFKLWHFWXUDO� LQGHSHQGHQW�GHVFULSWLRQ�RI�

WKH�DSSOLFDWLRQ�DQG�WKHUHIRUH�WKH�UHTXHVW�IRU�D�KLJK�OHYHO�ODQJXDJH�

HQWU\��WKH�DVSHFWV�RI�VLOLFRQ�DUHD�DQG�SRZHU�GLVVLSDWLRQ�KDYH�EHHQ�

FRQVLGHUHG�GXULQJ�DUFKLWHFWXUH�GHILQLWLRQ��:KHQ�DQDO\]LQJ�D�'63�

VXEV\VWHP� WKHVH� WZR� SDUDPHWHUV� DUH� PDLQO\� LQIOXHQFHG� E\� WKH�

PHPRU\�VXEV\VWHPV��$�KLJK�FRGH�GHQVLW\�DQG�DV� OHVV�DV�SRVVLEOH�

PHPRU\�DFFHVVHV�FDQ�EH� WDUJHWHG� WR�REWDLQ�JRRG�UHVXOWV��%HVLGHV�

XQDOLJQHG� SURJUDP� PHPRU\�� 6,0'� VXSSRUW� DQG� DQ� DGDSWDEOH�

LQVWUXFWLRQ�FRGLQJ�D� OHDQ�FRUH�DUFKLWHFWXUH�KDV�EHHQ�FKRVHQ��7KLV�

DOVR� VXSSRUWV� WKH� GHYHORSPHQW� RI� WKH� &�&RPSLOHU�� $� VFDOHDEOH�

LQVWUXFWLRQ� EXIIHU� IRU� LQQHU� ORRSV� DQG� DQ� RSWLPL]HG� KDUGZDUH�

LPSOHPHQWDWLRQ� RI� WKH� FRUH� DUFKLWHFWXUH� ZLOO� DGGLWLRQDOO\� UHGXFH�

WKH�SRZHU�GLVVLSDWLRQ��

)�� '*%+"�� "!�%��,'�'��7KLV� VHFWLRQ� LV� XVHG� WR� H[SODLQ� WKH� H[SORUDWLRQ� SDUDPHWHUV�

DYDLODEOH� IRU� WKH� '63� DUFKLWHFWXUH� GHVFULEHG� LQ� 6HFWLRQ� �� 7KH�

H[SORUDWLRQ�SDUDPHWHUV�ZLOO�EH�GLVFXVVHG�DQG�WKH�LQIOXHQFH�RQ�WKH�

RYHUDOO� V\VWHP�SHUIRUPDQFH� �F\FOH� FRXQW�� VLOLFRQ� DUHD�DQG�SRZHU�

GLVVLSDWLRQ�� RI� WKH� '63� VXEV\VWHP� DQDO\]HG�� 7KH� H[SORUDWLRQ�

SDUDPHWHUV�

�� 5HJLVWHU�ILOH�

�� 1XPEHU�NLQG�RI�SDUDOOHO�H[HFXWLRQ�XQLWV�

�� 0HPRU\�EDQGZLGWK��GDWD�SURJUDP��

�� ,QVWUXFWLRQ�VL]H�ELQDU\�HQFRGLQJ�

�� 3LSHOLQH�VWDJHV�

DUH� DYDLODEOH� IRU� WKH�'63�DUFKLWHFWXUH��7KH� LQIOXHQFH�RI� WKH�FRUH�

LWVHOI� FRPSDUHG� ZLWK� WKH� LQIOXHQFH� RI� WKH�PHPRU\� VXEV\VWHP� RQ�

WKH� FRQVXPHG� VLOLFRQ� DUHD� LV� TXLWH� ORZ�� %XW� IRU� ORZ� FRVW�

DSSOLFDWLRQV�DGGLWLRQDO��PPð�DUH�DOUHDG\�FULWLFDO�DQG�WKHUHIRUH�

DOVR�WKH�LQIOXHQFH�RQ�WKH�FRUH�DUHD�ZLOO�EH�SRLQWHG�RXW��

)�� -��)RU� D� ORDG�VWRUH� DUFKLWHFWXUH� WKH� UHJLVWHU� ILOH� SOD\V� DQ� LPSRUWDQW�

UROH�� (DFK� RI� WKH� LQVWUXFWLRQV� XVLQJ� RSHUDQGV� KDV� YDOXHV� VWRUHG�

LQVLGH� WKH� UHJLVWHU� ILOH��$V�D� UXOH�RI� WKXPE�DERXW��RI� WKH�FRUH�

DUHD�DUH�FRQVXPHG�E\�WKH�UHJLVWHU�ILOH��

7KH� DGYDQWDJH� RI� RSHUDWLQJ� RQ�YDOXHV� RI� WKH� UHJLVWHU� ILOH�� LV� WKDW�

LQWHUPHGLDWH� UHVXOWV� GR� QRW� KDYH� WR� EH� VWRUHG� EDFN� LQWR� WKH� GDWD�

PHPRU\� DQG� IHWFKHG� DJDLQ� IRU� D� FRQVHFXWLYH� RSHUDWLRQ�� ZKLFK�

UHGXFHV� WKH�SRZHU�GLVVLSDWLRQ�DW� WKH�PHPRU\�SRUWV��+RZHYHU� WKH�

VL]H� RI� WKH� UHJLVWHU� ILOH� KDV� WR� ILW� WR� WKH� UHTXLUHPHQWV� RI� WKH�

DSSOLFDWLRQ� FRGH�� ,QFUHDVLQJ� WKH� VL]H� RI� WKH� UHJLVWHU� ILOH� UHGXFHV�

WKH�DFWLYLWLHV�DW�WKH�GDWD�PHPRU\�SRUWV��EXW�WKH�HQWULHV�KDYH�WR�EH�

HQFRGHG�LQWR�WKH�LQVWUXFWLRQV��7KLV�KDV�LQIOXHQFH�RQ�WKH�VL]H�RI�WKH�

LQVWUXFWLRQ�ZRUGV�DQG� WKHUHIRUH�RQ�WKH�FRGH�GHQVLW\��7KH�QXPEHU�

RI�HQWULHV�RI�WKH�UHJLVWHU�ILOH�LV�LQIOXHQFLQJ�WKH�VL]H�RI�WKH�FURVVEDU�

DW�WKH�UHDG�ZULWH�SRUWV�DQG�WKHUHIRUH�WKH�UHDFKDEOH�FRUH�IUHTXHQF\��

7R� RYHUFRPH� WKHVH� SUREOHPV� WKH� UHJLVWHU� ILOH� TXLWH� RIWHQ� LV�

FOXVWHUHG� >�@�� 8QIRUWXQDWHO\� FOXVWHUHG� UHJLVWHU� ILOHV� UHVWULFW� WKH�

&RPSLOHU� GXULQJ� LQVWUXFWLRQ� VFKHGXOLQJ� DQG� UHTXLUH� DGGLWLRQDO�

LQVWUXFWLRQV�WR�WUDQVIHU�YDOXHV�IURP�RQH�FOXVWHU�WR�WKH�QH[W��,Q�WKH�

SURSRVHG�'63�FRQFHSW�RI�VHFWLRQ��DOO�HQWULHV�RI�WKH�UHJLVWHU�ILOHV�

DUH�KDQGOHG�HTXDO�DQG�FOXVWHULQJ�KDV�QRW�EHHQ� WDNHQ� LQWR�DFFRXQW�

�DQ\ZD\� GDWD� DQG� DGGUHVV� YDOXHV� DUH� VWRUHG� LQ� GLIIHUHQW� UHJLVWHU�

ILOHV��

8VLQJ�D�UHJLVWHU�ILOH�ZLWK�IHZ�HQWULHV�FDQ�OHDG�WR�D�ZDVWH�RI�FORFN�

F\FOHV� DQG� DQ� LQFUHDVH� RI� WKH� SURJUDP� PHPRU\�� ,I� DOO� DYDLODEOH�

HQWULHV�DUH�DOUHDG\�XVHG�DQG�IXUWKHU�VSDFH�IRU�LQWHUPHGLDWH�UHVXOWV�

LV� UHTXLUHG�� WKH� UHJLVWHU� ILOH� FRQWHQW� KDV� WR�EH� VSLOOHG� WR� WKH�GDWD�

PHPRU\��7KH�VSLOO�FRGH�FRQVXPHV�F\FOHV�� LQVWUXFWLRQ�ZRUGV��DQG�

WKHUHIRUH� LQIOXHQFHV� WKH� FRGH� GHQVLW\� RI� WKH� DSSOLFDWLRQ�� DQG�

DGGLWLRQDO�GDWD�PHPRU\��

)�$� !�� .��/�%� ��#��9/,:�'63�DUFKLWHFWXUHV�VXSSRUW�WKH�H[HFXWLRQ�RI�PRUH�WKDQ�RQH�

LQVWUXFWLRQ�SHU�F\FOH��7KHVH�LQVWUXFWLRQV�FDQ�EH�ZHOO�XVHG�IRU�ILOWHU�

RSHUDWLRQV� �ZKLFK� FDQ� EH� VHHQ� LQ� D� ULFK� VHW� RI� DYDLODEOH�

EHQFKPDUNV� RI� YDULRXV� FRUH� YHQGRUV�� ,Q� ORZ� FRVW� DSSOLFDWLRQV��

ZKHUH�QRW�PRUH�WKDQ�RQH�FRUH�LV�IHDVLEOH��QXPEHU�FUXQFKLQJ� OLNH�

ILOWHU�RSHUDWLRQV�ZLOO�EH�GRQH�TXLWH�RIWHQ�LQ�GHGLFDWHG�+:�DQG�WKH�

FRQWURO� FRGH� DQG� FRQILJXUDWLRQ� FRGH� LV� GRPLQDWLQJ� WKH� FRGH�

H[HFXWHG� RQ� WKH� '63�� 7KHVH� DSSOLFDWLRQV� FDQQRW�PDNH� XVDJH� RI�

WKH�SDUDOOHO�DYDLODEOH�XQLWV��

6XSSRUWLQJ� VHYHUDO� LQVWUXFWLRQV� LQ� SDUDOOHO� KDV� LQIOXHQFH� RQ� WKH�

FRUH� VL]H�� %HVLGHV� WKH� XQLWV� LWVHOI�� WKH� GHFRGHU� VWUXFWXUH� KDV� WR�

LQFUHDVHG�DQG�WKH�QXPEHU�RI�UHDG�ZULWH�SRUWV�IRU�WKH�UHJLVWHU�ILOH�LV�

UDLVLQJ��$GGLWLRQDO�LQVWUXFWLRQV�KDYH�WR�EH�IHWFKHG�DQG�WKH�VL]H�RI�

WKH�SURJUDP�PHPRU\�SRUW�KDV�WR�EH�LQFUHDVHG��

%HIRUH�DGGLQJ�DQ�DGGLWLRQDO�XQLW�LQWR�WKH�FRUH�DUFKLWHFWXUH��H�J��DQ�

DGGLWLRQDO�0$&� �0XOWLSO\�$FFXPXODWH�� XQLW�� WKH� EHQHILW� IRU� WKH�

DSSOLFDWLRQ� KDV� WR� EH� FRQVLGHUHG� TXLWH� FDUHIXOO\�� $QDO\]LQJ� WKH�

IXOO� DSSOLFDWLRQ� FRGH� FDQ� OHDG� WR� GLIIHUHQW� UHTXLUHPHQWV�

FRQFHUQLQJ�WKH�QXPEHU�RI�QHFHVVDU\�SDUDOOHO�GDWD�SDWKV�FRPSDUHG�

WR�RQO\�IRFXVLQJ�RQ�ILOWHU�RSHUDWLRQ�EHQFKPDUNV��

)�)� ,�� 0�1��.� �� 2�7KH�XVDJH�RI� WKH�DYDLODEOH� V\VWHP�UHVRXUFHV� LV� LQIOXHQFHG�E\� WKH�

DYDLODEOH� PHPRU\� EDQGZLGWK� �GDWD� DQG� SURJUDP�� $GGLQJ�

DGGLWLRQDO� FRPSXWDWLRQDO� XQLWV� WR� WKH� '63� VXEV\VWHP�� ZLWKRXW�

SURYLGLQJ� WKH� SRVVLELOLW\� WR� IHWFK� WKH� QHFHVVDU\�RSHUDQGV� OHDG� WR�

XQXVHG�V\VWHP�SHUIRUPDQFH��

+DYLQJ� VHYHUDO� GDWD� PHPRU\� SRUWV� �H�J�� VXSSRUWLQJ� ��

LQGHSHQGHQW� DGGUHVVHV�� LQFUHDVHV� WKH�SHUIRUPDQFH� IRU� DOJRULWKPV�

OLNH�))7��2Q�WKH�RWKHU�KDQG�WKH� LQIOXHQFH�RQ�WKH�FRUH�VL]H�LV�WKH�

QHHG� RI� DGGLWLRQDO� $*8V� �$GGUHVV� *HQHUDWLRQ� 8QLW�� DQG� WKH�

QHFHVVDU\�URXWLQJ�WR� WKH�GDWD�PHPRU\��$VVXPLQJ�FURVVEDUV�DW�WKH�

PHPRU\� VXE�V\VWHP� WR� UHGXFH� WKH� SUREDELOLW\� RI� PHPRU\�

FROOLVLRQV� �PRUH� WKDQ� RQH� DGGUHVV� SRLQW� WR� WKH� VDPH� DGGUHVV�

VSDFH��WKHVH�FURVVEDUV�KDYH�WR�EH�JURZQ��7KH�GDWD�PHPRU\�DFFHVV�

LV�WLPH�FULWLFDO�DQG�WKHUHIRUH�LQFUHDVLQJ�WKH�FURVVEDU�KDV�LQIOXHQFH�

RQ�WKH�UHDFKDEOH�FRUH�IUHTXHQF\��

Unit 1 Unit 2 Unit 3 Unit 4 Unit 5

Instr. n Instr. n+1

Instr. n+2 Instr. n+3



Instr. n+8t �

)LJXUH��9/,:�DUFKLWHFWXUH�

9/,:� DUFKLWHFWXUHV� DUH� SURYLGLQJ� WKH� H[HFXWLRQ� RI� VHYHUDO�

LQVWUXFWLRQV�LQ�SDUDOOHO��,I�QRW�DOO�RI�WKH�XQLWV�DUH�XVHG�HDFK�F\FOH��

PHPRU\�VSDFH�WR�VWRUH�WKH�ORQJ�LQVWUXFWLRQ�ZRUG�ZRXOG�EH�ZDVWHG�

�VHH� )LJXUH� �� 7KH� SURSRVHG� '63� FRQFHSW� VXSSRUWV� D� FRQVWDQW�

IHWFK�EXQGOH� �H�J�� LQVWUXFWLRQ�ZRUGV�� DQG�D� VFDOHDEOH�H[HFXWLRQ�

EXQGOH��H�J��EHWZHHQ��XS�WR��LQVWUXFWLRQ�ZRUGV�LQ�SDUDOOHO��7R�

SUHYHQW� VWDOO� F\FOHV� DQ� LQVWUXFWLRQ� EXIIHU� LV� LQFOXGHG� WR�

FRPSHQVDWH� WKH� PHPRU\� EDQGZLGWK� PLVPDWFK� EHWZHHQ� IHWFK�

EXQGOH� DQG�H[HFXWLRQ�EXQGOH��7KH� UHODWLRQVKLS�EHWZHHQ� WKH� IHWFK�

EXQGOH� VL]H� DQG� WKH� H[HFXWLRQ� EXQGOH� VL]H� LQIOXHQFHV� WKH� V\VWHP�

SHUIRUPDQFH��6WDOO�F\FOHV�DUH�QHFHVVDU\�LI� WKH�IHWFK�EXQGOH�VL]H�LV�

WRR�VPDOO�IRU�WKH�UHTXLUHPHQWV�RI�WKH�DSSOLFDWLRQ��+DYLQJ�D�ZLGHU�

SURJUDP� PHPRU\� SRUW� OHDG� WR� DGGLWLRQDO� URXWLQJ� HIIRUW� WR� WKH�

SURJUDP�PHPRU\�ZLWK�LQIOXHQFH�RQ�WKH�VLOLFRQ�DUHD��

�

Inst

ruct

ion

Buf

fer

constantfetch bundle(e.g. 4 instr.)

scaleableexecution

bundle(e.g. 1 up to 10 instr.)

�

)LJXUH��,QVWUXFWLRQ�%XIIHU�

7KH�VL]H�RI�WKH�LQVWUXFWLRQ�EXIIHU�LQ�)LJXUH��LQIOXHQFHV�WKH�RYHUDOO�

V\VWHP�SHUIRUPDQFH�IRU�LQQHU�ORRS�>�@��2QFH�IHWFKHG�QR�DGGLWLRQDO�

SRZHU�GLVVLSDWLRQ�DW�WKH�SURJUDP�PHPRU\�SRUW�ZLOO�EH�FRQVXPHG��

,I�WKH�EXIIHU�LV�WRR�VPDOO�WR�VWRUH�WKH�ORRS�ERG\��WKH�DGYDQWDJH�RI�

WKH�LQVWUXFWLRQ�EXIIHU�LV�ORVW��([FHHGLQJ�WKH�QXPEHU�RI�HQWULHV�KDV�

LQIOXHQFH�RQ�WKH�FRUH�VL]H�DQG�RQ�WKH�UHDFKDEOH�FRUH�IUHTXHQF\��

)�3� �� 4.�� $� KLJK� FRGH� GHQVLW\� LV� PDQGDWRU\� IRU� DQ� HPEHGGHG� '63�

SURFHVVRU�� %XW� KLJK� FRGH� GHQVLW\� KDV� WR� EH� FRQVLGHUHG� RQ�

DSSOLFDWLRQ� OHYHO�� ZKLFK� GRHV� QRW� DOORZ� KLGLQJ� SUREOHPV� LQ� WKH�

PLFURDUFKLWHFWXUH��)UHTXHQWO\�XVHG� LQVWUXFWLRQV�KDYH� WR�EH�FRGHG�

PRUH�HIILFLHQW��XQXVHG�LQVWUXFWLRQV�FDQ�EH�HYHQ�UHPRYHG��

7KH� QDWLYH� LQVWUXFWLRQ� VL]H� RI� WKH� SURSRVHG� '63� FRQFHSW� LQ�

6HFWLRQ� �� LV� �� ELW�� $� SDUDOOHO�ZRUG� LV� XVHG� IRU� ORQJ� LPPHGLDWH�

YDOXHV�� IDU� EUDQFK� WDUJHWV� RU� ORQJ� RIIVHWV�� 7KH� DULWKPHWLF�

LQVWUXFWLRQV� FDQ� EH� XVHG� ZLWK� �� GLIIHUHQW� RSHUDQGV�� ([HFXWLQJ�

FRQWURO�FRGH��DV�LQ�WKH�H[DPSOH�RI�)LJXUH��ZLOO�QRW�PDNH�XVDJH�RI�

WKH� SURYLGHG� ��RSHUDQG� DULWKPHWLF� LQVWUXFWLRQV�� )RU� WKH� FRGH�

H[DPSOH� LQ� )LJXUH� �� WKH� ��ELW� LQVWUXFWLRQ� VHW�� ZKHUH� H�J�� WKH�

SDUDOOHO�ZRUG� LV�XVHG� IRU�HQFRGLQJ�RI� WKH��RSHUDQG� LQVWUXFWLRQV��

LQFUHDVHV� WKH� FRGH� GHQVLW\� E\� DERXW� �� 7KH� VPDOOHU� QDWLYH�

LQVWUXFWLRQ�ZRUG� LV� DQ� DGYDQWDJH� IRU� WKH� FRGH�RI� WKH� H[DPSOH� LQ�

)LJXUH��

number instructions: 710long instructions: 215

bytes: 1850delay nops: 211

710164

2185211

16/32 20/40

�

)LJXUH��,QVWUXFWLRQ�6L]H�

7KH�SRZHU�GLVVLSDWLRQ�DW�WKH�SURJUDP�PHPRU\�SRUWV�LV�FDXVHG�E\�

WKH� VZLWFKLQJ� DFWLYLW\� EHWZHHQ� ³�´� DQG� ³�´� DQG� YLFH� YHUVD��

5HGXFLQJ� WKH� VZLWFKLQJ� DFWLYLW\� E\� H�J�� UHRUGHULQJ� RI� WKH�

LQVWUXFWLRQV� LQVLGH� WKH�H[HFXWLRQ�EXQGOH��ZKLFK�KDV� QR� LQIOXHQFH�

RQ�WKH�VFKHGXOLQJ�RU�FKDQJLQJ�WKH�ELQDU\�FRGLQJ�RI�WKH�LQVWUXFWLRQ�

VHW�� WR� UHGXFH� WKH� VZLWFKLQJ� DW� WKH� SURJUDP� PHPRU\� SRUWV� KDV�

VLJQLILFDQW� LQIOXHQFH� RQ� WKH� SRZHU� GLVVLSDWLRQ� RI� WKH� '63� VXE�

V\VWHP��

)�5� %��7KH� QXPEHU� RI� SLSHOLQH� VWDJHV� LQIOXHQFHV� WKH� UHDFKDEOH� FORFN�

IUHTXHQF\�RI�WKH�'63�VXEV\VWHP��6SOLWWLQJ�FRPSOH[�RSHUDWLRQV�RU�

WLPLQJ�FULWLFDO�IXQFWLRQV�OLNH�PHPRU\�DFFHVV�HQDEOHV�KLJKHU�FORFN�

IUHTXHQF\�� )URP� D� V\VWHP� SRLQW� RI� YLHZ� WKH� UHVXOW� FDQ� EH�

PLVOHDGLQJ��$VVXPLQJ�D�SLSHOLQH�VWUXFWXUH�DV�GHVFULEHG�LQ�6HFWLRQ�

�� ZLWK� LQVWUXFWLRQ� IHWFK�� LQVWUXFWLRQ� GHFRGH� DQG� LQVWUXFWLRQ�

H[HFXWH�� D� KLJKHU� FORFN� IUHTXHQF\� FDQ� EH� UHDFKHG�ZKHQ� VSOLWWLQJ�

WKHP�RYHU�VHYHUDO�FORFN�F\FOHV��

,QFUHDVLQJ� WKH� QXPEHU� RI� FORFN� F\FOHV� IRU� WKH� LQVWUXFWLRQ� IHWFK�

SKDVH� LQFUHDVHV� WKH� QXPEHU� RI� EUDQFK� GHOD\V�� ,Q� FRQWURO� FRGH�

VHFWLRQV�RU� LQ� V\VWHPV�ZLWK�D� ORW�RI� LQWHUUXSW�VHUYLFH�URXWLQHV� WKH�

IUHTXHQF\� JDLQ� FDQ� EH� FRPSHQVDWHG� E\� WKH� XQXVDEOH� EUDQFK�

GHOD\V��3UHGLFDWHG� H[HFXWLRQ� FDQ�EH�XVHG� IRU� VPDOO�EUDQFK� WDUJHW�

GLVWDQFHV�� EXW� LW� KDV� LQIOXHQFH� RQ� WKH� FRGH� GHQVLW\� RI� WKH�

DSSOLFDWLRQ�� %UDQFK� SUHGLFWLRQ�PHFKDQLVPV� FDQQRW� EH� WDNHQ� LQWR�

DFFRXQW� GXH� WR� WKH� PLVVLQJ� SUHGLFWDELOLW\�� 7KH� V\VWHP� KDV� WR� EH�

GHVLJQHG� ZLWK� ZRUVW� FDVH� SDUDPHWHUV� DQG� WKHUHIRUH� D� SUHGLFWLRQ�

KDV�WR�EH�DVVXPHG�DV�³QRW�WR�EH�WDNHQ´��

6SOLWWLQJ� WKH� H[HFXWLRQ� SKDVH� LQFUHDVHV� WKH� ORDG�LQ�XVH�

GHSHQGHQF\� DQG� WKH� GHILQH�LQ�XVH� GHSHQGHQF\� >�@�� 'DWD�

IRUZDUGLQJ� FLUFXLWV� LQFUHDVH� WKH� VLOLFRQ� DUHD� DQG� LQFUHDVH� WKH�

GHVLJQ�FRPSOH[LW\��

3�� %6%�� 7KLV�VHFWLRQ�LV�XVHG�WR�LQWURGXFH�WKH�IORZ�XVLQJ�'63[3ORUH�DQG�WR�

GLVFXVV�WKH�UHVXOWV�RI�WKH�VWDWLF�DQG�G\QDPLF�DQDO\VLV��'63[3ORUH�

VXSSRUW�WKH�V\VWHP�GHVLJQHU�WR�FKRRVH�WKH�FRUH�DUFKLWHFWXUH�ILWWLQJ�

WR�WKH�DSSOLFDWLRQ��,W�FDQ�DOVR�EH�XVHG�WR�ILQG�DQ�RSWLPDO�+:�6:�

SDUWLWLRQLQJ� IRU� D� FHUWDLQ� DSSOLFDWLRQ�RU� WR�GHFLGH�ZKLFK� SDUWV� RI�

WKH�DSSOLFDWLRQ�DUH�ZHOO�VXLWHG�IRU�D�+:�&R�3URFHVVRU��

3�� %6%�� -+"7�7KH� EDVLV� RI� '63[3ORUH� LV� DQ� (YDOXDWLRQ� &�&RPSLOHU� DQG� D� UH�

FRQILJXUDEOH� ,QVWUXFWLRQ� 6HW� 6LPXODWRU� EDVHG� RQ� D� FRPSRQHQW�

IUDPHZRUN��6WDUWLQJ�ZLWK� WKH� DSSOLFDWLRQ� GHVFULEHG� LQ�&� �'63�&�

SUHIHUUHG� GXH� WR� WKH� VXSSRUW� RI� IUDFWLRQDO� GDWD� W\SHV�� WKH�

HYDOXDWLRQ� &�FRPSLOHU� ZLOO� EH� XVHG� WR� PDS� WKH� DSSOLFDWLRQ� WR� D�

FHUWDLQ� FRUH� DUFKLWHFWXUH�� 7KH� IHDWXUHV� RI� WKH� FKRVHQ� FRUH�

DUFKLWHFWXUH�ZLOO�EH�GHVFULEHG�LQ�D�FRQILJXUDWLRQ�ILOH�EDVHG�RQ�[PO��

7KH� DYDLODEOH� H[SORUDWLRQV� SDUDPHWHUV� RI� WKH� '63� DUFKLWHFWXUH�

LQWURGXFHG�LQ�6HFWLRQ��KDV�EHHQ�GLVFXVVHG�LQ�6HFWLRQ��,Q�)LJXUH�

�� WKH� IORZ� RI� '63[3ORUH� LV� LOOXVWUDWHG�� 7KH� DSSOLFDWLRQ� &�&RGH�

ZLOO� EH� FRPSLOHG� ZLWK� WKH� HYDOXDWLRQ� &�FRPSLOHU� DQG� WKHQ�

VLPXODWHG�ZLWK� WKH� FRQILJXUDEOH� ,66� �,QVWUXFWLRQ� 6HW� 6LPXODWRU��

7KH� UHVXOW� RI� WKH� &�&RPSLOHU� LV� DQ� DVVHPEOHU� GHVFULSWLRQ� RI� WKH�

DSSOLFDWLRQ�DQG�WKH�VWDWLF�DQDO\VLV�UHVXOWV��ZKLFK�ZLOO�EH�GLVFXVVHG�

LQ� GHWDLO� LQ� 6HFWLRQ� �� 7KH� G\QDPLF� DQDO\VLV� UHVXOWV� DUH�

JHQHUDWHG� E\� VLPXODWLRQV� ZLWK� WKH� ,66� DQG� ZLOO� EH� GHVFULEHG� LQ�

6HFWLRQ��7KH�DVVHVVPHQW�UHVXOWV�FDQ�EH�XVHG�WR�DGDSW�WKH�FRUH�

DUFKLWHFWXUH��ZLWK�WKH�FRQILJXUDWLRQ�ILOH��DQG�WR�UHVWDUW�WKH�DQDO\VLV�

SURFHVV��

�

.asm

static analysisresults

dynamic analysisresults

Eval.C-Comp.

ISS

core architectureconfiguration (xml)

Application codein C (DSP-C)

�

)LJXUH��'63[3ORUH�)ORZ�

3�$� ��7R�REWDLQ�WKH�VWDWLF�DQDO\VLV�UHVXOWV�WKH�(YDOXDWLRQ�&�&RPSLOHU�LV�

XVHG�� 7KH� DXWRPDWLF� JHQHUDWHG� UHVXOWV� KDYH� WR� EH� DFFXUDWH�

FRPSDUHG�ZLWK�PDQXDO�FRGLQJ��OHVV�WKDQ��RYHUKHDG��8VLQJ�D�

&�&RPSLOHU�KDYLQJ�DQ�RYHUKHDG�RI�DERXW��WR��WLPHV�FRPSDUHG�

ZLWK� PDQXDO� RSWLPL]HG� FRGH�� ZKLFK� LV� TXLWH� XVXDO� IRU� WRGD\¶V�

DYDLODEOH� &�&RPSLOHUV� ZLOO� DGXOWHUDWH� WKH� UHVXOWV�� 7KLV� VHFWLRQ�

LQWURGXFHV�VRPH�VWDWLF�UHVXOWV�JHQHUDWHG�E\�'63[3ORUH�WR�TXDQWLI\�

WKH�FRUH�DUFKLWHFWXUH��

�� $V� DOUHDG\� SRLQWHG� RXW� LQ� VHFWLRQ� �� WKH� PHPRULHV� DUH�

GRPLQDWLQJ� WKH� VLOLFRQ� DUHD� RI� WKH� '63� VXEV\VWHP�� 7KHUHIRUH� D�

KLJK�FRGH�GHQVLW\�LV�PDQGDWRU\�IRU�GHVLJQLQJ�D�VXFFHVVIXO�SURGXFW��

7KH� YDOXH� FRGH� VL]H� SURYLGHV� DQ� LQGLFDWLRQ� FRQFHUQLQJ� WKH�

QHFHVVDU\� SURJUDP� PHPRU\� WR� PDS� WKH� DSSOLFDWLRQ� RQWR� WKH�

FKRVHQ�LQVWUXFWLRQ�VHW��DQG�LQVWUXFWLRQ�FRGLQJ��,Q�)LJXUH��D�SDUW�

RI� WKH� FRGH� VL]H� DQDO\VLV� UHVXOWV� LV� LOOXVWUDWHG�� 7KH� QXPEHU� RI�

LQVWUXFWLRQV� VXPV� XS� DOO� LQVWUXFWLRQV� LQGHSHQGHQW� RI� WKHLU� ZRUG�

OHQJWK�� WKH� QXPEHU� RI� ORQJ� LQVWUXFWLRQV� FRXQWV� WKH� LQVWUXFWLRQV�

XVLQJ�D�SDUDOOHO�ZRUG��2IIVHWV�IRU�DGGUHVV�JHQHUDWLRQ�DQG�EUDQFK�

WDUJHWV� DUH� DYDLODEOH� DW� OLQN� WLPH�� 7KHUHIRUH� D� OLQNHU� IHHGEDFN� LV�

SDUW� RI� WKH� FRPSLOHU� EDFNHQG� RSWLPL]DWLRQV�� :HLJKWLQJ� WKH�

GLIIHUHQW� LQVWUXFWLRQ� OHQJWK� DQG� QRUPDOL]LQJ� LW� WR� WKH� UHVXOW�E\WHV�

JLYHV� D� QXPEHU� IRU� WKH� FRGH� VL]H�� 7KH� SURSRVHG� '63� FRQFHSW�

VXSSRUWV� XQDOLJQHG� SURJUDP� PHPRU\�� 7KH� UHVXOW� YDOXH� E\WHV� LV�

HTXDO� WR� WKH� V\VWHP�FRGH� VL]H��ZLWK�QR� DGGLWLRQDO�PHPRU\� HIIRUW�

FDXVHG� E\� WKH� PLFUR�DUFKLWHFWXUH�� 7KH� FKRVHQ� '63� DUFKLWHFWXUH�

VXSSRUWV� GHOD\HG� DQG� QRQ�GHOD\HG� EUDQFK�LQVWUXFWLRQV�� ,I� WKH�

LQVWUXFWLRQ� DQG� GDWD� GHSHQGHQFLHV� RI� WKH� DSSOLFDWLRQ� FRGH� DOORZ�

WKH� ILOO� RI� WKH� EUDQFK� GHOD\V�� WKH� V\VWHP� SHUIRUPDQFH� FDQ� EH�

LQFUHDVHG�DQG�WKH�FRGH�GHQVLW\�LPSURYHG��QR�123V�QHFHVVDU\��,I�

WKH� EUDQFK� GHOD\V� FDQQRW� EH� XVHG�� D� QRQ�GHOD\HG� EUDQFK�

LQVWUXFWLRQ� LQFUHDVHV� WKH� FRGH� GHQVLW\� �QR� 123V� QHFHVVDU\�� WKH�

F\FOHV�DUH�ZDVWHG�DQ\ZD\��'XH�WR�WKH�OHDQ�SLSHOLQH�VWUXFWXUH�DQG�

WKHUHIRUH� OHVV� EUDQFK� GHOD\V� WKH� XVH� RI� EUDQFK� SUHGLFWLRQ�

PHFKDQLVP� KDV� EHHQ� QHJOHFWHG�� 7KH� SUREOHP� RI� SUHGLFWLRQ� LV�

DOUHDG\�PHQWLRQHG�LQ�VHFWLRQ��

number instructions: 827long instructions: 70

bytes: 2242,5delay nops: 42

�

)LJXUH��&RGH�6L]H�

�� 7KH� SDUDOOHOLVP� UHVXOW� YDOXH� JLYHV� DQ� LQGLFDWLRQ� KRZ� ZHOO� WKH�

DSSOLFDWLRQ�FRGH�FDQ�XVH� WKH�SURYLGHG�SDUDOOHO�XQLWV��'XH� WR�GDWD�

GHSHQGHQFLHV�LQ�WKH�DSSOLFDWLRQ�FRGH�WKH�SDUDOOHOLVP�FDQ�EH�TXLWH�

VPDOO� �HVSHFLDOO\� LQ� FRQWURO� FRGH� VHFWLRQV�� 7KH�SDUDOOHOLVP� ZLOO�

EH� DQDO\]HG� RQ� WKH� OHYHO� RI� H[HFXWLRQ� EXQGOHV�� 7KH� DSSOLFDWLRQ�

FRGH�XVHG�IRU�WKH�H[DPSOH�LQ�)LJXUH��LV�FRQWURO�FRGH��7KH�FKRVHQ�

'63� DUFKLWHFWXUH� HQDEOHV� WKH� H[HFXWLRQ� RI� �� LQVWUXFWLRQV� LQ�

SDUDOOHO��PHPRU\�RSHUDWLRQV��DULWKPHWLF�RSHUDWLRQV��SURJUDP�

IORZ� RSHUDWLRQ�� 7KH� DQDO\VLV� UHVXOWV� LQ� )LJXUH� �� JLYH� D� ILUVW�

LQGLFDWLRQ�DERXW�WKH�HIILFLHQW�XVDJH�RI�WKH�SDUDOOHO�XQLWV��

bundles with 1 instruction(s): 255bundles with 2 instruction(s): 105bundles with 3 instruction(s): 52bundles with 4 instruction(s): 5bundles with 5 instruction(s): 0

�

)LJXUH��%XQGOH�$QDO\VLV�

7R� XQGHUVWDQG� WKH� OLPLWV� RI� WKH� FKRVHQ� DUFKLWHFWXUH� IRU� WKH�

DSSOLFDWLRQ� FRGH�� D� PRUH� GHWDLOHG� DQDO\VLV� LV� QHFHVVDU\�� )RU� WKH�

UHVXOWV� LQ� )LJXUH� �� WKH� VDPH� FRQWURO� FRGH� H[DPSOH� DV� IRU� WKH�

UHVXOWV�RI�)LJXUH��KDV�EHHQ�XVHG��7KH�H[HFXWLRQ�EXQGOHV�DUH�VSOLW�

LQWR� GLIIHUHQW� FDWHJRULHV�� HDFK� FRQVLVWLQJ� RI� XS� WR�� LQVWUXFWLRQV��

EXW�UHYHDOLQJ�WR�GLIIHUHQW�GDWD�SDWKV��

��

Nop (incl. delay fill nops)MemX.................................MemY.................MemX.MemY.......................................ALU1............MemX............ALU1..................................ALU1.ALU2.......MemX............ALU1.ALU2.......MemX.MemY.ALU1.ALU2................................................BrUnitMemX...............................BrUnitMemX.MemY....................BrUnit......................ALU1..........BrUnitMemX............ALU1..........BrUnit......................ALU1.ALU2.BrUnitMemX............ALU1.ALU2.BrUnit...........MemY.ALU1.ALU2.BrUnit

80 19.2 %82 19.7 % 8 1.9 %10 2.4 %48 11.5 %

22 5.3 % 15 3.6 %

4 1.0 % 2 0.5 %

37 8.9 % 26 6.2 %

1 0.2 % 32 7.7 % 18 4.3 % 29 6.9 % 2 0.5 % 1 0.2 %

�

)LJXUH��3DUDOOHOLVP�

7KH�UHVXOWV�OLVWHG�LQ�)LJXUH��DUH�DYDLODEOH�IRU�HDFK�EDVLF�EORFN�RI�

WKH�&�FRGH��ZKLFK�DOORZV�D�ILQH�JUDLQ�DQDO\VLV�RI�WKH�DSSOLFDWLRQ��

7KH�VWDWLF�DQDO\VLV�UHVXOWV��DV�LQ�)LJXUH��KDYH�WR�EH�ZHLJKWHG�E\�

WKH�UHVXOWV�RI�WKH�G\QDPLF�DQDO\VLV��,I�WKH�VWDWLF�DQDO\VLV�LQGLFDWHV�

WKDW� RQO\� RQH� RU� WZR� H[HFXWLRQ� EXQGOH� FDQ� PDNH� XVH� RI� WKH�

DYDLODEOH�SDUDOOHO�XQLWV�LW�ZRXOG�EH�IHDVLEOH�WR�UHPRYH�VRPH�RI�WKH�

GDWD�SDWKV��%XW�LI�WKHVH�EXQGOHV�DUH�SDUW�RI�LQQHU�ORRSV��ZKLFK�H�J��

WKH�DSSOLFDWLRQ�FRGH�LV�H[HFXWLQJ��RI�WKH�H[HFXWLRQ�WLPH��WKHQ�

UHPRYLQJ� WKH� XQLWV� ZLOO� GHFUHDVH� WKH� V\VWHP� SHUIRUPDQFH�

VLJQLILFDQWO\��

�� $�OLVW�RI� WKH� LQVWUXFWLRQV�XVHG�WR�PDS�WKH�DSSOLFDWLRQ�FRGH�WR�WKH�

FKRVHQ�FRUH�DUFKLWHFWXUH�HQDEOHV�ILQH�WXQLQJ�RI�WKH�LQVWUXFWLRQ�VHW��

)UHTXHQWO\� XVHG� LQVWUXFWLRQV� FDQ� EH� HQFRGHG� PRUH� HIILFLHQW� WR�

LQFUHDVH� WKH� FRGH� GHQVLW\�� )RU� ORZ� FRVW� DSSOLFDWLRQV� XQXVHG�

LQVWUXFWLRQV� HYHQ� FDQ� EH� UHPRYHG� WR� VTXHH]H� WKH� LQVWUXFWLRQ�

FRGLQJ�DQG�WR�UHGXFH�WKH�QXPEHU�RI�ELWV�QHFHVVDU\�IRU�HQFRGLQJ�RI�

WKH�LQVWUXFWLRQ�VHW��

�� 7KH� VL]H� RI� WKH� LPPHGLDWH� YDOXHV� FDQ� EH� LGHQWLILHG� DOUHDG\� DIWHU�

VWDWLF� DQDO\VLV�� 7KLV� UHVXOW� JLYHV� DQ� LQGLFDWLRQ� LI� WKH� DYDLODEOH�

FRGLQJ� VSDFH� LQVLGH� WKH� LQVWUXFWLRQ� ZRUGV� DOUHDG\� ILWV� WKH�

DSSOLFDWLRQ�UHTXLUHPHQWV�RU�D�ORW�RI�SDUDOOHO�ZRUGV�DUH�QHFHVVDU\�WR�

FRGH� WKH� LPPHGLDWH� YDOXHV�� 7KH� YDOXH� UDQJH� RI� WKH� LPPHGLDWH�

YDOXHV� LQIOXHQFHV� WKH� PLQLPXP� HQFRGLQJ� VSDFH� RI� WKH� QDWLYH�

LQVWUXFWLRQ�ZRUG��

�� 7KH� PRVW� XVHOHVV� RSHUDWLRQ� UXQQLQJ� RQ� DQ� HPEHGGHG� '63�

FRUH�LV�WKH�123��QR�RSHUDWLRQ��EHFDXVH�GXULQJ�WKLV�WLPH�WKH�'63�

GRHV� QRW� SURYLGH� DQ\WKLQJ� WR� WKH� V\VWHP� SHUIRUPDQFH�� 'XH� WR�

GLIIHUHQW�UHDVRQV�LW�FDQQRW�EH�FRPSOHWHO\�RPLWWHG��DQG�GXULQJ�WKLV�

WLPH�QRUPDOO\�WKH�FRQVXPHG�SRZHU�GLVVLSDWLRQ�LV�SUHWW\�ORZ��2QH�

H[DPSOH�IRU�XVHIXO�123V�LV�D�123�IRU�XQXVDEOH�EUDQFK�GHOD\V�RU�

123V� XVHG� WR� DOLJQ� EUDQFK� WDUJHWV� RU� ORRS� ERGLHV�� 7KDW� FDQ� EH�

QHFHVVDU\�WR�SUHYHQW�VWDOO�F\FOHV��LI�WKH�EUDQFK�WDUJHW�LV�QRW�SDUW�RI�

WKH� LQVWUXFWLRQ� EXIIHU� DQG� WKH� H[HFXWLRQ� EXQGOH� LV� VSUHDG� RYHU�

VHYHUDO�IHWFK�EXQGOHV��

NOP

�

��

)LJXUH��123�IRU�%UDQFK�7DUJHW�$OLJQPHQW�

,Q�)LJXUH��WKH�H[HFXWLRQ�EXQGOH�RI�WKH�EUDQFK�WDUJHW�LV�VSOLW�RYHU�

WZR�IHWFK�EXQGOHV��,I�WKH�VHFRQG�SDUW�RI�WKH�IHWFK�EXQGOH�LV�QRW�SDUW�

RI� WKH� LQVWUXFWLRQ� EXIIHU� D� VWDOO� F\FOH� LV� QHFHVVDU\�� $OLJQLQJ� WKH�

EUDQFK� WDUJHW� E\� DGGLQJ� D� 123� ZLOO� LQFUHDVH� WKH� V\VWHP�

SHUIRUPDQFH�E\�D�UHDVRQDEOH�RYHUKHDG�IRU�WKH�V\VWHP�FRGH�VL]H��

7KHUH� DUH� VHYHUDO� UHDVRQV� �PRVWO\� FDXVHG� E\� WKH� PLFUR�

DUFKLWHFWXUH� RI� WKH�'63�FRUH�� IRU�123��7KH�VWDWLF�DQDO\VLV�JLYHV�

DQ� LQGLFDWLRQ�� ZK\� WKH� 123� ZDV� LQWURGXFHG�� $V� DOUHDG\�

PHQWLRQHG�LQ�VHFWLRQ��LW�LV�QHFHVVDU\�WR�ZHLJKW�WKHVH�UHVXOWV�E\�

WKH�UHVXOWV�RI� WKH�G\QDPLF�DQDO\VLV�WR�SUHYHQW�ORFDO�RSWLPLVDWLRQV�

ZLWK� LQIOXHQFH� RQ� WKH� RYHUDOO� V\VWHP�SHUIRUPDQFH�� ,I� WKH� EUDQFK�

WDUJHW�RI�WKH�H[DPSOH�LQ�)LJXUH��LV�SDUW�RI�D�ZKLOH�ORRS��H[HFXWHG�

TXLWH�IUHTXHQWO\�DQG�WKH�ORRS�LV�WRR�ORQJ�IRU�WKH�LQVWUXFWLRQ�EXIIHU��

WKH� DOLJQPHQW� ZLOO� KDYH� VLJQLILFDQW� LQIOXHQFH� RQ� WKH� V\VWHP�

SHUIRUPDQFH��

3�)� ��7R�ZHLJKW�WKH�UHVXOWV�RI�WKH�VWDWLF�DQDO\VLV�SURFHVV��VLPXODWLRQV�DUH�

QHFHVVDU\�� 7R� VLPXODWH� WKH� DVVHPEOHU� FRGH� JHQHUDWHG� E\� WKH� &�

&RPSLOHU�D�F\FOH�WUXH�,QVWUXFWLRQ�6HW�6LPXODWRU�LV�XVHG��7R�FRYHU�

WKH� IOH[LELOLW\� DYDLODEOH� IRU� WKH� FKRVHQ� '63� DUFKLWHFWXUH�� WKH�

6LPXODWRU� KDV� WR� EH� FRQILJXUDEOH�� 7KH� VLPXODWRU� XVHG� IRU�

'63[3ORUH� LV� EDVHG� RQ� D� FRPSRQHQW� IUDPHZRUN�� ZKLFK� FDQ� EH�

FRQILJXUHG�GXULQJ�UXQ�WLPH��7KH�VLPXODWRU�LV�EXLOW�XS�RI�GLIIHUHQW�

OD\HUV�� ZKLFK� DOORZV� KLJKHU� VLPXODWLRQ� IUHTXHQF\� IRU� OHVV�

GHEXJJLQJ�DQG�HYDOXDWLRQ�IHDWXUHV��

7KLV�VHFWLRQ�FRQWDLQV�VRPH�H[DPSOHV�IRU�G\QDPLF�DQDO\VLV�UHVXOWV�

DQG� WKHLU� LQIOXHQFH� RQ� WKH� RYHUDOO� V\VWHP� SHUIRUPDQFH�� 7KH�

6LPXODWRU� VXSSRUWV� YLVXDO� LQWHUSUHWDWLRQ� RI� PRVW� RI� WKH� DQDO\VLV�

UHVXOWV�� ZKLFK� HQDEOHV� D� FRPIRUWDEOH� DQDO\VLV� RI� WKH� DSSOLFDWLRQ�

FRGH��

�� !��"��#��$��$V� DOUHDG\� PHQWLRQHG� WKH� QXPEHU� RI� PHPRU\� DFFHVVHV�

VLJQLILFDQWO\� LQIOXHQFHV� WKH� SRZHU� GLVVLSDWLRQ� RI� WKH� '63�

VXEV\VWHP�� 7KH� FKRVHQ� '63� DUFKLWHFWXUH� XVHV� DQ� LQVWUXFWLRQ�

EXIIHU� WR� EDODQFH� WKH� PHPRU\� EDQGZLGWK� EHWZHHQ� WKH� IHWFK�

EXQGOHV� �ZKLFK� KDYH� FRQVWDQW� OHQJWK�� DQG� WKH� H[HFXWLRQ� EXQGOH�

�WKH� VL]H�RI� WKH� H[HFXWLRQ�EXQGOH� LV�GHSHQGHQW�RQ� WKH�QXPEHU�RI�

LQVWUXFWLRQV� H[HFXWHG� LQ� SDUDOOHO�� $QDO\]LQJ� LQQHU� ORRS� FRGH�

VHFWLRQV� FDQ� EH� XVHG� WR� GHWHUPLQH� WKH� RSWLPDO� VL]H� RI� WKH�

LQVWUXFWLRQ�EXIIHU�IRU�D�FHUWDLQ�DSSOLFDWLRQ��

�� %�� !��&��$�7R� SUHYHQW� VWDOO� F\FOHV� GXH� WR� PLVVLQJ� LQVWUXFWLRQV�� WKH� IHWFK�

FRXQWHU� LV� ORRVHO\� GHFRXSOHG� IURP� WKH� SURJUDP� FRXQWHU� WR� SUH�

IHWFK� LQVWUXFWLRQV�GXULQJ�WKH�H[HFXWLRQ�RI�H[HFXWLRQ�EXQGOHV�ZLWK�

IHZ�LQVWUXFWLRQ�ZRUGV��,Q�FRQWURO�FRGH�VHFWLRQV�LW�FDQ�KDSSHQ�ZLWK�

D�KLJKHU�UDWH�RI�EUDQFK�LQVWUXFWLRQV�WKDQ�D�ORW�RI�VXFK�LQVWUXFWLRQV�

ZLOO� EH� IHWFKHG� LQWR� WKH� LQVWUXFWLRQ� EXIIHU� WKDW� ZLOO� QHYHU� EH�

H[HFXWHG�� +RZHYHU� IHWFKLQJ� LQVWUXFWLRQV� KDV� DQ� LPSDFW� RQ� WKH�

SRZHU� GLVVLSDWLRQ��8QXVHG� SURJUDP�PHPRU\� HQDEOHV� WR� LGHQWLI\�

WKHVH� FRGH� VHFWLRQV�� 7R� UHGXFH� WKH� IHWFK� RYHUKHDG�� WKH�SURSRVHG�

'63� FRUH� VXSSRUWV� D� XVHU�FRPSLOHU� GULYHQ� KDQGOLQJ� RI� WKH�

LQVWUXFWLRQ�EXIIHU�FRQWHQW�DQG� WKHUHIRUH�WR�FRQWURO�WKH�FRUUHODWLRQ�

EHWZHHQ�IHWFK�FRXQWHU�DQG�SURJUDP�FRXQWHU��

�� '(��$��)*��7KLV� SDUDPHWHU� FRXQWV� WKH� H[HFXWLRQ� IUHTXHQF\� RI� WKH� H[HFXWLRQ�

EXQGOHV�� 7RJHWKHU�ZLWK� WKH� VWDWLF� DQDO\VLV� UHVXOW�SDUDOOHOLVP�� WKH�

IUHTXHQWO\� H[HFXWHG� EXQGOHV� FDQ� EH� LGHQWLILHG�� :LWK� WKLV�

FRUUHODWLRQ� WKH� GHFLVLRQ� FRQFHUQLQJ� WKH� QXPEHU� RI� SDUDOOHO� GDWD�

SDWKV�FDQ�EH�GRQH��

7KH�SDUDPHWHU� FDQ� EH� JUDSKLFDOO\� LOOXVWUDWHG� DQG� WKHUHIRUH�HDVLO\�

XVHG� WR� LGHQWLI\� KRW� VSRWV� LQVLGH� WKH� DSSOLFDWLRQ� FRGH�� 7KLV�

DQDO\VLV�FDQ�EH�XVHG�IRU�RSWLPL]LQJ�WKH�+:�6:�SDUWLWLRQLQJ�DQG�

IRU� LGHQWLI\LQJ� FRGH� VHFWLRQV��ZKLFK�PDNH� WKH�XVH�RI�D�KDUGZDUH�

&R�3URFHVVRU�IHDVLEOH��

�� '(��"�+��$)��7KH� H[HFXWLRQ� IUHTXHQF\� RI� HDFK� LQVWUXFWLRQ� ZLOO� EH� TXDQWLILHG��

7KH�UHVXOW�FDQ�EH�XVHG�WR�RSWLPL]H�WKH�LQVWUXFWLRQ�VHW��)UHTXHQWO\�

IHWFKHG� LQVWUXFWLRQV�FDQ�EH�FRGHG�PRUH�HIILFLHQWO\�WR�LQFUHDVH�WKH�

FRGH�GHQVLW\��5HGXFHG� IHWFK�HIIRUW� UHGXFHV� WKH�VZLWFKLQJ�DFWLYLW\�

DW� WKH�SURJUDP�PHPRU\�SRUWV�DQG� WKHUHIRUH� LQIOXHQFHV� WKH�SRZHU�

GLVVLSDWLRQ�RI�WKH�'63�VXEV\VWHP��

�� $��$V� DOUHDG\� PHQWLRQHG� LQ� VHFWLRQ� �� WKH� SURSRVHG� '63�

DUFKLWHFWXUH� KDV� WR� FRPSHQVDWH� D� PHPRU\� EDQGZLGWK� PLVPDWFK�

EHWZHHQ� IHWFK�EXQGOHV�DQG�H[HFXWLRQ�EXQGOHV��7R�REWDLQ� WKLV�� DQ�

LQVWUXFWLRQ� EXIIHU� LV� XVHG��$Q\ZD\� DW� EUDQFK� WDUJHWV� RU� LQWHUUXSW�

VHUYLFH� URXWLQHV� D� VWDOO� F\FOH� FDQ� EHFRPH� QHFHVVDU\� WR� IHWFK� WKH�

ILUVW�H[HFXWLRQ�EXQGOH� �H�J��H[DPSOH�LQ�)LJXUH��7KH�QXPEHU�RI�

VWDOO� F\FOHV� FDXVHG� E\� IHWFK� RSHUDWLRQV� LV� FRXQWHG� DQG� WRJHWKHU�

ZLWK� WKH� UHVXOW� RI� H[HFXWLRQ� F\FOHV�EXQGOH� RSWLPL]DWLRQV� H�J��

EUDQFK�WDUJHW�DOLJQPHQW�FDQ�WDNH�SODFH��GHVFULEHG�LQ�VHFWLRQ��

7KH� FKRVHQ�'63�DUFKLWHFWXUH�KDV� WZR�GDWD�PHPRU\�SRUWV��ZKLFK�

FDQ� EH�XVHG� LQGHSHQGHQWO\� LQ�SDUDOOHO�� ,I�ERWK� DGGUHVVHV�SRLQW� WR�

WKH�VDPH�SK\VLFDO�PHPRU\�EORFN��D�VWDOO�F\FOH� LV�PDQGDWRU\��7KH�

VWDOO�F\FOHV� LQLWLDWHG�E\� WKH�PHPRU\�DFFHVVHV�DUH�VXPPHG�XS�DQG�

WRJHWKHU� ZLWK� WKH� UHVXOW� RI� H[HFXWLRQ� F\FOHV�EXQGOH� WKH� PHPRU\�

SDUWLWLRQLQJ�FDQ�EH�RSWLPL]HG��

5�� "!�+#� "!�&KRRVLQJ� WKH�³EHVW´�'63�FRUH�RU� WKH�³EHVW�FRQILJXUDWLRQ´��LI�WKH�

'63� FRUH� LV� FRQILJXUDEOH�� WKDW� PHHWV� WKH� UHTXLUHPHQWV� RI� WKH�

DSSOLFDWLRQ�LV�TXLWH� WULFN\��'63[3ORUH�FDQ�EH�XVHG�WR�DQDO\]H�WKH�

DSSOLFDWLRQV� UHTXLUHPHQWV� DW� DQ� HDUO\� VWDJH� RI� WKH� SURMHFW� DQG� WR�

TXDQWLI\� WKH� LQIOXHQFH� RI� DUFKLWHFWXUDO� GHFLVLRQV� RQ� WKH� VL]H� DQG�

SRZHU�GLVVLSDWLRQ�RI�WKH�'63�VXEV\VWHP��

'63[3ORUH� LV� EDVHG� RQ� DQ� HYDOXDWLRQ� &�&RPSLOHU� DQG� D�

FRQILJXUDEOH�FRPSRQHQW� IUDPHZRUN�DQG� LV�SDUW�RI�D�SURMHFW� IRU�D�

FRQILJXUDEOH�'63�FRUH��

8�� !"7+'��,'!��$7$,5�� &KULVWLDQ� 'RSSOHU� )RUVFKXQJVJHVHOOVFKDIW� DQG� WKH� *UD]�

8QLYHUVLW\�RI�7HFKQRORJ\�KDYH�VXSSRUWHG�SDUW�RI�WKH�ZRUN��

9�� '-'�'!�'�� -��/��+HQQHVV\�DQG�'��$��3DWWHUVRQ��&RPSXWHU�$UFKLWHFWXUH��

$� 4XDQWLWDWLYH� $SSURDFK�� 0RUJDQ� .DXIPDQQ� 3XEOLVKHUV��

6DQ�0DWHR�&$��

�� 3�/DSVOH\�� -�%LHU�� $�6KRKDP� DQG� (�$�/HH��'63� 3URFHVVRU�)XQGDPHQWDOV��$UFKLWHFWXUHV�DQG�)HDWXUHV��,(((�3UHVV��1HZ�

<RUN��

�� &�3DQLV�� $�6FKLONH�� +�+DELJHU�� -�1XUPL�� $Q� $XWRPDWLF�

'HFRGHU�*HQHUDWRU� IRU�D�&RQILJXUDEOH�'63�&RUH��1RUFKLS�

��&RSHQKDJHQ��

�� 7H[DV� ,QVWUXPHQWV�� 706��&��[[� 3URJUDPPHUV� *XLGH��

��

�� -�+DQG\��7KH�&DFKH�0HPRU\�%RRN��$FDGHPLF�3UHVV��

�� 'H]VR� 6LPD�� 7HUHQFH� )RXQWDLQ�� 3HWHU� .DFVXN�� $GYDQFHG�&RPSXWHU� $UFKLWHFWXUHV�� $� 'HVLJQ� 6SDFH� $SSURDFK��

$GGLVRQ�:HVOH\�3XEOLVKLQJ�&RPSDQ\��

�

PUBLICATION 11 C. Panis, U. Hirnschrott, G. Laure, W. Lazian, J. Nurmi, �DSPxPlore - Design Space Exploration Methodology for an Embedded DSP Core�, in Proceedings of the 2004 ACM Symposium on Applied Computing (SAC 04), Nicosia, Cyprus, March 14-17, 2004, pp. 876-883.

©2004 IEEE. Reprinted, with permission, from proceedings of the 2004 ACM Symposium on Applied Computing.

DSPxPlore – Design Space Exploration Methodology for an Embedded DSP Core

ABSTRACT High mask and production costs for the newest CMOS silicon technologies increase the pressure to develop hardware platforms useable for different applications or variants of the same application. To provide flexibility for these platforms the need on software programmable embedded processors is increasing. To close the gap concerning consumed silicon area and power dissipation between optimized hardware implementations and software based solutions, it is necessary to adapt the subsystem of the embedded processor to application specific requirements. DSPxPlore can be used to explore the design space of RISC based embedded core architectures. At an early stage of the project the main architectural requirements of the application code can be identified in order to meet the area and power dissipation requirements. During the development process DSPxPlore supports fine-tuning of the subsystem architecture (e.g. modifications of the binary coding of instructions). DSPxPlore is part of a development project for a configurable DSP core.

Keywords DSPxPlore, Design Space Exploration, embedded DSP

1. INTRODUCTION Decreasing feature size and increasing system complexity enables to map complex system functions onto one die (SoC, System-on-Chip) or into one package (SiP, System in a Package). High mask and production costs for the newest silicon technologies increases the need of platform solutions, enabling to use the same silicon for several applications. Providing flexibility to the platform solutions allowing to realize several applications with the same silicon, embedded software programmable cores can be used. Therefore the importance of embedded processors like microcontrollers, protocol processors and digital signal processors (DSP) is increasing.

One aspect of using dedicated hardware implementations instead of software based solutions is the degree of efficiency in terms of consumed silicon area and power dissipation. To overcome the efficiency drawbacks of software based solutions without loosing the advantage of flexible platform architectures, providers of embedded core architectures provides the possibility to modify their core architectures to application specific requirements [1][2].

Making use of the additional degree of freedom the requirements of the application have to be understood. Quite often the core decisions are done by the most experienced engineers focusing on the aspects �what is already available?� and �what has been already proven in silicon?� to reduce the risk. Different requirements of the applications lead to not optimal solutions concerning consumed silicon area and power consumption by using one core subsystem. In the price-critical consumer IC market this can be crucial for the own market position and the revenues.

This paper introduces DSPxPlore, a design space exploration methodology for an embedded configurable DSP processor. DSPxPlore can be used to understand the requirements of the application code on the processor architectures in an early stage of the project. During the development project DSPxPlore can be


Europastrasse 4

A-9524 Villach, Austria

+43 4242 90500 2124

[email protected]

Ulrich Hirnschrott Vienna University of Technology

Argentinierstrasse 8

A-1040 Vienna, Austria

+43 1 58801 58520

[email protected]

Gunther Laure Infineon Technologies Austria

Siemensstrasse 2


+43 4242 305 0

[email protected] Wolfgang Lazian

Infineon Technologies Austria

Siemensstrasse 2


+43 4242 305 0

[email protected]

Jari Nurmi Tampere University of Technology

P.O.Box 553

FIN-33101 Tampere

+358 3 3115 3884

[email protected]

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC�04, March 14-17, 2004, Nicosia, Cyprus Copyright 2004 ACM 1-58113-812-1/03/04...$5.00

used to fine-tune the chosen architecture. The first part introduces the RISC based DSP core architecture used as basis for DSPxPlore. The introduced methodology is not limited to this architecture. The second part is used to discuss the design space of RISC based DSP core architectures. The influence of configuration parameters concerning consumed silicon area and power dissipation of the core subsystem is illustrated. The third part introduces the DSPxPlore methodology. DSPxPlore is based on an optimizing C-Compiler (about 5 to 10% overhead compared with manual assembly coding) and a cycle-true Instruction Set Simulator (ISS), based on a configurable component framework. A XML-based configuration file contains a description of the chosen core architecture and is used to configure the tool chain and to automatically update the documentation for the DSP core. The last section covers some exploration examples and gives an outlook for future work.

2. ARCHITECTURAL INTRODUCTION This section is used to give a short introduction of the DSP architecture DSPxPlore has been developed for. The main architectural features and the instruction set have been defined under consideration of low silicon area and power dissipation of the DSP subsystem and to enable the development of an optimizing C-Compiler (about 5-10% overhead compared with manual assembler coding). An example architecture has been chosen for this paper and will be shortly introduced in this section.

Figure 1: Core Overview

The proposed DSP core features a modified Dual-Harvard load-store architecture (an overview is illustrated in Figure 1) [3]. An independent data bus connects the program memory with the DSP core, an instruction buffer is used to execute loop constructs power efficient [4]. Data and program memory are featuring different address spaces [5]. The bit width of the ports in Figure 1 is scaleable, which allows application specific adaptation of memory bandwidth.

The core is featuring a RISC like 3-phase pipeline, instruction fetch, decode and execute. The three phases can be split over several clock cycles. The example architecture illustrated in Figure 2 is using five clock cycles for the three pipeline phases.

The instruction fetch phase is split over a fetch and an align clock cycle, the decode stage takes one clock cycle, the execution phase is split over two clock cycles (EX1, EX2). Splitting of a pipeline phase over several clock cycles enables to reach higher clock frequencies. But additional pipeline stages in the fetch phase

increases the number of branch delays, additional clock cycles for the execution phase leads to increased load-in-use and define-in-use dependencies [6]. Therefore deeper pipeline structures can lead to a decreased overall system performance due to data and control dependencies in the application code.

Figure 2: Pipeline

The instructions are divided into three operation classes: load/store instructions, used to transfer data between the data memory and the register file, arithmetic/logic instructions performing calculations on register values, and branch instructions influencing the program flow. Each instruction consists of one or two instruction words. The size of the native instruction word for the example architecture is 20 bit; the optional second word is used for long immediate values and offsets (parallel word as in Figure 3).

Figure 3: instruction coding

All arithmetic instructions support 3 operands, which prevents data copy functions between different registers of the register file. All features of the DSP core are coded inside the instruction set; no mode bits are used to increase code density. The drawbacks of using mode bits are limitations during instruction scheduling when moving instructions between different mode sections [7]. As illustrated in Figure 3 the first three bits of the instruction words are used for assigning the operation class and the alignment information.

Figure 4: parallelism

The number of possible parallel executed instructions is scaleable. The example architecture enables the execution of up to five instructions in parallel. It is possible to execute two load/store, two arithmetic and one branch instruction in parallel (illustrated in Figure 4). The chosen programming model is VLIW (Very long instruction Word), which implies static scheduling (data and control dependencies are analyzed and resolved in software). The drawback of traditional VLIW architectures featuring low code density is solved by xLIW (a scalable long instruction word) [8]. xLIW is based on VLES (Variable long execution set) and additionally supports a decreased program memory port. For this purpose also the already mentioned instruction buffer is used [9].

The example architecture supports two busses to data memory. Therefore two independent AGUs (address generation unit) are available. Each of the AGU can make use of each of the address registers (no banked address register). If two parallel generated addresses access the same physical memory block, the core hardware automatically detects the hazard and serializes the memory operations. Data memory operations exceeding the physical size of the memory port are realized as consecutive memory operations at the same data bus.

All common DSP address modes like memory direct, register direct and register indirect addressing are supported. The auto in-/decrement address operation supports pre- and post address calculation and an efficient stack frame addressing. The size of the modulo buffer is programmable; the start address of the buffer has to be aligned. This is a compromise between hardware effort and supported features.

Figure 5: register files

Load-store architecture implies that all operands for the arithmetic instructions reside in registers. Therefore the register file has an important role. The structure of the register file and the size and the number of registers is configurable; for the example architecture a register file as in Figure 5 is used. It is split into two parts, a data register file, and an address register file.

Figure 6: data register file

The data register file as in Figure 6 consists of 8 accumulators, 8 long registers or 16 data registers. Two consecutive data registers can be addressed as a long register. A long register including guard bits (for higher precision calculation) can be addressed as accumulator. The size of the operands can be modified application specific. The registers inside the register file are orthogonal, which means that none of them is assigned to a certain instruction. The drawback of an orthogonal register file is the crossbar to

enable mapping of the read and write ports to each of the registers.

3. DESIGN SPACE FOR RISC BASED DSP ARCHITECTURES This section is used to introduce the available design space for RISC based DSP subsystems with influence on area consumption, power dissipation and overall system performance. The example architecture is used to illustrate the main architectural features. The influence of some of the parameters is illustrated by first exploration results.

3.1 Register File The register file in load-store architectures has a central role. All arithmetic instructions are fetching their operands from the register file and store their results into the register file. Therefore the number of supported registers of the register file influences the performance parameters of the DSP subsystem.

Supporting less register reduces the necessary core area but can lead to additional spill code. Spill code is added if no registers are available to store a result. In this case register file content has to be stored to data memory to free register resources. If any of the spilled data is needed again, it has to be reloaded from memory. The added spill code increases the demand on program memory and therefore decreases the code density of the application code. Further it increases execution time and therefore decreases system performance.

Supporting a larger register file with more entries increases the core area and again has influence on code density. More entries require more coding space to address the register entries � especially considering the orthogonal requirement to enable the development of an optimizing C-Compiler banking registers or supporting registers for special functions is not possible.

Figure 7: register file (64-bit accu)

It is possible to change the structure of the register file. Figure 7 is used to illustrate an example for a 64-bit data register file (e.g. used for a 64-bit/quad MAC architecture). The register file on the left side of Figure 7 has a similar structure as the register file in Figure 5; instead of using guard bits the accumulator supports 64-bit. The number of addressable data registers has not been doubled; the necessary coding space for the additional data registers has influence on the code density. If an application code requires the use of more than 16 data registers to reduce the spill code a register file like in Figure 7 can support up to 32 data register. The same register file on the right side of Figure 7 has a different structure. Eight of the data registers are mapped onto the first two accumulator registers, the remaining eight are split onto the next six accumulator registers.

3.2 Data paths Increasing the number of data paths and parallel executed instructions increase the maximum possible calculation power of core architectures. Providing the possibility to execute several instructions in parallel requires the availability of operands. Therefore a balanced relation between memory bandwidth, number of independent load/store instructions and the number of arithmetic data paths characterize the possible performance of core architectures.

Table 1: ILP

Tjaden and Flynn

31 library programs

1,2-3,2

1,9

Kuck et.al.

20 Fortran programs

1,2-17 4

Riesemann, Foster

7 Fortran/ assembler

1,2-3 1,8 1,4-1,6

1,6

Jouppy 8 modulo2 programs

1,6-2,2

1,9 2,4-3,3

2,8

Lam, Wilson

6SPECmarks+4others

1,5-2,8

2,1 2-293

Additional influence comes from the application program executed on the core architecture. Control and data dependencies can lead to a low usage of the provided core resources. In Table 1 some examples for ILP (instruction level parallelism) can be found. The benchmark examples are based on general purpose code (column 3,4) as also scientific code (column 5,6). The average ILP in these examples is about two to three instructions.

Traditional algorithms executed on DSP cores are filtering operations. Filter algorithms are characterized by an inner loop, where a significant amount of execution time is spent. These inner loops (considering software pipelining) can make efficient use of parallel provided resources. Therefore the ILP for this kind of algorithms is higher than that for general purpose code. The MAC (multiply and accumulate) instruction is typical used for e.g. FIR filter algorithms. Therefore the performance of DSP cores is measured in the number of provided MAC instructions per second and in the number of clock cycles needed for execution (considering the define-in-use dependency).

Changing the number and kind of data paths has influence on the core hardware. If the changes in the data path structure have influence on the instruction set (by adding or removing instructions) the code density is influenced. Changes of the data path structure have influence on the execution bundle. Therefore after changing the data path structure, it is necessary to verify if the average relation between the size of the fetch and execution bundle is still balanced and that the memory bandwidth still fits to the data path structure.

3.3 Memory bandwidth The memory bandwidth is closely related to the data path parameter. Providing a lot of parallelism with insufficient memory bandwidth is resulting in bad usage of available core resources. The size of the memory ports has influence on power dissipation and consumed silicon area of a DSP subsystem.

Data memory port: Today most of the commercial available DSP cores are supporting two independent data memory busses. Supporting additional busses increases the flexibility of data transfer and several algorithms e.g. FFT algorithms can make use of it. But the drawback of more memory ports is the hardware effort for additional AGUs (Address Generation Unit) and the wiring effort to the memory sub system.

Program memory port: For most of the commercial available DSP cores, the size of the program memory port is equal to the maximum number of parallel executed instructions. Similar as for the data memory port, the wiring is influencing area and power consumption. One possibility to decouple the size of the program memory port with the provided parallelism of the execution unit is the usage of an instruction buffer, as mentioned in section 2.

3.4 Instruction size/encoding The instruction set describes the functionality supported by the core architecture. The mapping of the instruction set to binary instruction words has significant influence on the area consumption of the core sub system, because the memory used to store the instructions is dominating the area consumption.

In Figure 8 an example for different mappings of the same instruction set to two different instruction layouts is illustrated. In the right example, the instruction set has been mapped using instructions with a native size of 16-bit, using 32 bit for the remaining instructions, which cannot be mapped to the native instructions set like three operand arithmetic instructions. For the example of the left column a native instruction word size of 20 bit is used, allowing to map all instructions into the native instruction word size. The second word is only used for long immediate values and offsets. Considering a certain algorithm (e.g. some control code as in Figure 8) the smaller native instruction word size is providing a lower overall code effort. This can be different for another code example, which e.g. requires three operand instructions, coded more efficient in the longer native instruction word.

Figure 8: example for instruction set mapping

The binary coding is influencing the switching activity at the program memory port and therefore the mapping of the instruction set to a certain binary coding has influence on the power dissipation of the DSP subsystem. More often used instructions can be coded more efficiently resulting in an increased code density. Also reordering of instructions inside the

same execution bundle can be performed in order to decrease power dissipation at the program memory bus [10][11][12].

3.5 Instruction buffer size The instruction buffer mentioned in section 2 is not available in each core, but shall be mentioned for the core architecture introduced in section 2. For this core the instruction buffer is used to compensate the memory bandwidth mismatch between fetch and execution bundle and also to execute loop constructs power efficient by reducing the number of memory accesses. To make use of this feature, the size of the instruction buffer has to be scalable to adapt the instruction buffer to application code specific requirements. Power efficient loop handling can only be achieved, if the loop body fits into the buffer. Therefore the chosen size of the instruction buffer has influence onto the power dissipation of the core subsystem. On the other side providing a buffer with many entries leads to a significant increase on core area.

3.6 Pipeline stages Increasing the number of pipeline stages allows increasing the reachable core frequency. Higher core frequencies lead to increased power dissipation due to the need of a higher supply voltage and an increased switching activity [13].

Increasing the number of pipeline stages also increases the core complexity, because additional hardware circuits like bypass are getting necessary to reduce the increased dependency between instructions of different pipeline stages [14][15][16].

Increasing the number of pipeline stages can even lead to a decrease of system performance due to control and data dependencies. Therefore a balanced pipeline structure considering dependencies of the application code and physical aspects of technology are important to obtain a good cost ratio between area consumption, power dissipation and system performance. Classifying core subsystems by MIPS, MOPs or MMACs or any other similar parameter is misleading: for an embedded core the core performance has to be classified, how efficient an application code can make use of the available core resources.

Increasing the number of pipeline stages for the fetch phase of the pipeline relaxes the timing at the program memory but increases the number of branch delays. Additional hardware circuits have to be introduced to compensate the unused branch delays [17][18]. Predicated execution can help to reduce the number of branch delays by reducing the number of conditional branch instructions [19].

Adding pipeline stages to speed up the execution phase and to relax the timing at the data memory ports leads to an increased define-in-use and load-in-use dependency. Bypass logic can be used to reduce the dependencies but again by increasing core complexity.

3.7 Summary This section has been used to briefly introduce the architectural features of RISC based core architectures (with focus on DSP cores) which are significant influencing the area consumption and power dissipation of the core subsystem. None of these parameters can be considered isolated; changing one of them influences several others. There is not a single shot solution

satisfying the requirements of all applications efficient. The application code executed on a core architecture make a certain core configuration efficient. To understand the requirements of an application code, the following section is used to introduce a design space exploration methodology for RISC based core subsystems.

4. EXPLORATION METHODOLOGY The DSP core architecture introduced in section 2 allows adapting the architectural features introduced in section 3. Providing a configurable DSP core architecture to meet application specific requirements enables to reduce area consumption and power dissipation. To find the optimal core architecture (optimal for one application) it is important to understand the application specific requirements.

For this purpose DSPxPlore is introduced. DSPxPlore can be used to analyze the influence of certain core subsystem configurations on the system parameter core area, power dissipation and overall system performance. During the product development process DSPxPlore supports a fine tuning of the core subsystem. The exploration methodology is based on an optimizing C-Compiler and a configurable ISS (instruction set simulator).

Figure 9: DSPxPlore Overview

In Figure 9 an overview of the exploration methodology is illustrated. An optimizing C- compiler is used to generate static analysis results. A cycle true Instruction Set Simulator (ISS) is used for evaluation of dynamic results. Both results together can be used to analyze the application specific requirements to the core subsystem. The chosen core configuration is located in an XML-based configuration file, which is used by both tools.

4.1 Static analysis To obtain reasonable accurate results for static analysis it is necessary to use a C-Compiler that generates near-optimal assembly code (compared to manually optimized code). If the quality of the C-Compiler is poor, the generated results can be misleading and architectural decisions can lead to a suboptimal solution. The C-Compiler for the core architecture introduced in section 2 provides an accuracy of about 5-10% overhead compared with manual coding. Some of the generated static evaluation results are

4.1.1 code size The memory of a DSP subsystem is dominating the silicon area consumption. Therefore a high code density reduces area consumption. An example for the parameter code size is illustrated in Figure 10. The number of instructions necessary to port the application code to the chosen core architecture is counted and the required long instructions are summed up. The chosen instruction word length is normalized to bytes to have a comparable value. The example architecture is using a 20-bit native instruction word and therefore the number of counted instructions have to be multiplied by 2,5 to get the code effort in bytes. Instructions with long words are counting double.

Figure 10: code size analysis

4.1.2 parallelism The analysis result parallelism gives an indication of the usage of the provided core resources. Data and control dependencies in the application code restrict the execution of parallel instructions and leads to a poor use of the available processor resources. The example in Figure 11 illustrates the dependency problem (on the left side a summary, on the right side more in detail).

Figure 11: bundle assignment

Only a few execution bundles can make use of the parallel units (the example architecture provides to execute five instructions in

parallel). DSP architectures like the C62x from Texas Instruments will not have a higher usage of the core resources, even if its relative performance (calculated as number of possible parallel instructions multiplied with the reachable clock frequency) provides higher numbers [20].

4.1.3 instruction histogram The instruction histogram analysis result provides a list of the used instructions and their static occurrence inside the application code. This result can be used to optimize the instruction set during fine tuning of the core subsystem (e.g. optimized coding of frequently used instructions).

4.1.4 immediate values The size of the immediate values can be analyzed already during the static process. This gives an indication about the needed coding space inside the instruction set. These results (similar as in Figure 8) can be used to choose the optimal size for the native instruction word.

4.1.5 delay slots The number of delay slots can significantly influence the overall system performance of a DSP subsystem. Delay slots are caused by branch instructions or function calls. Increasing the number of pipeline cycles during the fetch phase results in more delay slots. Some of the delay slots can be filled with useful instructions, the others are lost cycles.

4.2 Dynamic analysis To weight the static analysis results, dynamic analysis are necessary. A cycle true instruction set simulator (ISS) is used to obtain the results. xSIM, the ISS used for the core introduced in section 2 is based on a configurable component framework. An XML based configuration file is used to define the chosen core configuration. Some of the dynamic results are

4.2.1 program memory fetch The fetch of instructions from program memory significantly influences the power dissipation of the DSP subsystem. Therefore reducing the switching at the program memory port can be used to reduce power dissipation. The number of fetch cycles from program memory is analyzed, the fetch frequency of the different fetch bundles is counted and the alignment analysis for loop and branch constructs is considered.

4.2.2 unused program memory The DSP core introduced in section 2 features an instruction buffer to overcome the bandwidth mismatch between fetch and maximum execution bundle size and to execute loop constructs power efficient. Especially during breaks in the program flow already fetched program data are not executed. This parameter is used to analyze which code sections have been fetched but not executed and can be used to reduce the switching at the program memory port.

4.2.3 execution count per bundle Counting the execution frequency of each execution bundle can be used to identify hot spots and to optimize the HW/SW

partitioning (e.g. deciding which parts can be more efficient implemented in hardware). Together with the static parallelism analysis the provided parallelism can be classified and the number of data paths adjusted to the requirements of the application code. The results can be visualized by xSIM, which easies the interpretation of the results.

4.2.4 execution count per instruction The list of used instructions generated during static analysis is extended by the execution count of instructions. With this information the instruction set and the binary coding can be optimized, increasing code density and decreasing switching activity at the program memory port. Frequently executed instructions can be coded more efficient, not used instructions even removed.

4.2.5 stall cycles During execution of application code stall cycles can take place. During the stall cycles the core is not contributing to the system performance of the DSP subsystem. This can be caused e.g. by simultaneous memory access to the same physical memory block or by missing program data, due to an empty instruction buffer (e.g. at not aligned branch targets). This information can be used to identify possible reasons and to modify the core architecture and the application code to prevent useless stall cycles.

5. RESULTS This section is used to illustrate some of the results using DSPxPlore. The set of benchmark examples consists of traditional DSP functions like FFT and code examples of the area of cryptology but also of control code examples e.g. framing algorithms.

For the results in Table 2 the size of the register file has been modified. The number in the first column is equal to the number of supported accumulator register.

Table 2: register size evaluation

#regs bundles inst. delay nops code size

4 24008 35284 2473 47263

8 17041 26544 2722 31810

16 14507 23046 2826 26497

Increasing the number of registers relaxes the register pressure for register allocation, resulting in a decreased code effort. In the second row the number of available registers has been increased from four to eight leading to a reduced code effort of about 50%. Doubling the register file again from eight to 16 accumulator registers increases the code density only by about additional 17%. The algorithm examples cannot make use of the additional registers. The results in Table 2 do not include the influence on the coding space due to the increased number of registers. The coding used for the comparison supports the medium sized register file; considering also the difference in the coding space will reduce the absolute distance between the result values.

Table 3: parallelism analysis

model bundles inst. delay nops code size

0 1M-1A-1B

21070 27962 2694 33441

1 2M-1A-1B

20783 28022 2714 33441

2 1M-2A-1B

18194 27728 2862 33196

3 2M-2A-1B

17872 27871 2866 33421

4 2M-3A-1B

17347 27890 2958 33558

The first column in Table 3 indicates the number of parallel executed instructions: The M for load/store instructions, A for arithmetic/logic instructions and B for branch instructions. As expected adding more units in parallel decreases the number of necessary execution bundles (column three in Table 3). Data and control dependencies reduce the effect of further added units. One remark concerning the increase of the number of branch delay NOPs: During compilation, the C-Compiler has been configured to execute the instructions as early as possible. Providing more parallelism leads to shorter branch distances and therefore fewer instructions are available to fill delay slots. The number of necessary NOP instructions for filling delay slot is increasing.

Table 4: model 0, branch delay slots

branch delay

bundles inst. delay nops code size

2 21051 27943 2694 33422

3 22544 29438 2686 34868

4 24118 31004 2664 36432

For the results in Table 4 (model 0) and Table 5 (model 4) the same core models as for the results of Table 3 are used. The parameter on the left side is the number of branch delays. Additional branch delays caused by further clock cycles used for the fetch phase of the pipeline e.g. to relax the timing at the program memory port, leads to an increased number of instructions and therefore to a decreased code density (e.g. due to additional NOP instructions to fill delay slots).

Table 5: model 4, branch delay slots

branch delay

bundles inst. delay nops code size

2 17324 27867 2959 33537

3 18897 29451 2966 35121

4 20524 31055 2947 36725

Comparing the results for model 0 and model 4, as expected the number of necessary execution bundles is decreasing. The configuration for the results of Table 4 supports to execute only

three instructions in parallel, the configuration used in Table 5 up to six instructions in parallel.

6. OUTLOOK DSPxPlore is used to understand the requirements of the application code on the core architecture, to identify hot spots and to optimize the HW/SW partitioning. DSPxPlore is still an expert system. For interpretation of the generated results and the related modifications of the core architecture a deep understanding of the core architecture, the configuration parameter and the influence of the chosen configuration onto silicon area and power consumption is necessary. In the next development phase it will be possible to get easier understandable feedback from DSPxPlore. This enables the system architect optimizing his core subsystem for application specific requirements and to gets hints for further optimizations.

7. SUMMARY DSPxPlore is a design space exploration methodology for RISC based embedded cores. Analyzing application specific requirements in an early stage of the project enables to modify the core subsystem and therefore to obtain low silicon area consumption and low power dissipation. During the design process DSPxPlore can be used for fine tuning of the core subsystem e.g. optimization of the binary coding to reduce power dissipation. With the application specific optimized core subsystems it is possible to reduce the gap between a dedicated hardware implementation and a core based solution providing the flexibility of software programmability. DSPxPlore is part of a project for a configurable DSP core.

8. ACKNOWLEDGMENTS The work has been supported by the Christian Doppler Lab for Compilation Techniques for Embedded Processors and by the EC with the project SOC-Mobinet (IST-2000-30094).

9. REFERENCES [1] www.arc.com

[2] www.tensilica.com

[3] Hennessy, J. L., Patterson, D. A., Computer Architecture. A Quantitative Approach, Morgan Kaufmann Publishers, San Mateo CA, 1996.

[4] Panis, C., Bramberger, M., Grünbacher, H., and Nurmi, J., A Scaleable Instruction Buffer for a Configurable DSP Core, ESSCIRC 2003, Lissabon, Portugal, 2003.

[5] Lapsley, P., Bier, J., Shoham, A., and Lee, E.A., DSP Processor Fundamentals, Architectures and Features, IEEE Press, New York, 1997.

[6] Sima, D., Fountain, T., and Kacsuk, P., Advanced Computer Architectures: A Design Space Approach, Addison Wesley Publishing Company, Harlow, 1997.

[7] Morgan, R., Building an Optimizing Compiler, Digital Press, 1998.

[8] Panis, C., Leitner, R., Grünbacher, H., and Nurmi, J., xLIW � a Scaleable Long Instruction Word, ISCAS 2003, Bangkok, Thailand, 2003.

[9] Panis, C., Leitner, R., Grünbacher, H., and Nurmi, J., Align Unit for a Configurable DSP Core, CSS 2003, Cancun, Mexico, 2003.

[10] Hirnschrott, U., and Krall, A., VLIW Operation Refinement for Reducing Energy Consumption, Proceedings of International Symposium on System-on-Chip '03, Tampere, 2003.

[11] Shin, D., Kim, J., and Chang, N., An Operation Rearrangement Technique for Power Optimization in (VLIW) Instruction Fetch, ACM, Munich, 2001.

[12] Choi, K., and Chatterjee, A., Efficient Instruction-Level Optimization Methodology for Low-Power Embedded Systems, Proceedings of International Symposium on System Synthesis ISSS 01, 2001.

[13] Chandrakasan, A., Sheng, S., and Brodersen, R., Low-Power (CMOS) Digital Design, Design. JSSC, Nr.4, 1992.

[14] Smith J.E., A study of branch prediction strategies, in Proc. 8th ASCA, pp.135-48, 1981.

[15] Albert D. and Avnon D., Architecture of the Pentium Microprocessors, IEEE Micro, Juni 1993.

[16] Heinrich J., MIPS1000 Microprocessor Users Manual Alpha Draft 11.Oct, Mips Technologies Inc., Mountain View. Ca, 1994

[17] Motorola Inc., Power PC620 RISC Microprocessor Technical Summary, MPC 620/D, Motorola Inc., 1994

[18] Lee J.K.F. and Smith A.J., Branch prediction strategies and branch target buffer design, Computer 17(1), pp.6-22, 1984.

[19] Pnevmatikos D.N. and Soshi G.S., Guarded Execution and branch prediction in dynamic ILP processors, In Proc. 21. ISCA, pp. 120-9, 1994.

[20] Texas Instruments, TMS320C6000 CPU and Instruction Set Reference Guide, Texas Instruments, 10.2000.

PUBLICATION 12 C. Panis, A. Schilke, H. Habiger, J. Nurmi, �An Automatic Decoder Generator for a Scaleable DSP Architecture�, in Proceedings of the 20th Norchip Conference (Norchip�02), Copenhagen, Denmark, November 11-12, 2002, pp. 127-132.

©2002 IEEE. Reprinted, with permission, from proceedings of the 20th Norchip Conference.

An Automatic Decoder Generator for a Scaleable DSPArchitecture

��

�� !"#"�"�#��$�%&'�(!"#"�"�#��$�$�

)�)�*��'� �� +��,-�� +��.+��/�"��"

�� [email protected]

��!"#"�"��'�0!"#"�"��%�

1�2+��3��4��5��

�� 5��6�7�(��#'�2�##%�%

��[email protected]

��!#�&##%%�#&&"'�(!#�&##%%�#��

�� !�� "# $�� !� �� %��&��'(��)��(��*��+

�� 8�59� �6� �� ( �5�� 8 ��4� �� 8��8� �� :��8��8 ��+��8�� 8��+��:��;6�� <��%#��8��:��=:��8��:*� �� 8 �� *�� (��5� '+�� :�� (��5 �� +��5 �4��8 *5 �� 8+ �� +�� >� ��+�� 8 �� :�� *� � *��+��>��4��*�� ?�� :��*�88�8�� @%A��(��:�� B+�� +�8 �� *� �4��8 *5 �88�� +�� :�� +��8�� *��8� �8��:�� 8�4��C� �:��4�8�� *��8��(�*��D�� B+�� 6� �� @�A� �� 4�� +��)DE �8�� 8 �� 84�� 8��8� �8��+�� 5��)DE �8��+ ��8� �8��D��8�� 5�� + ��8�� + �+��+ ��:��*��(��8��88�� 4��+ �+��8� �8��8��+�� 8��*��:��*��8�+�� 8��(��+��+� ��

127

�� + �� *� �� # ��+ �� < ��+�� 8F�� *�� + ��=�.� ��+ �� *��<��:��'��+��%=��#*��:��*�+��8��+ �� + �� 8��8��8�� + �� 8� �8� �� %G *�� :�� *� +��8 �� 8� ��+ ��

� ��

�� + �� :��8� �� 85 *��5 �8�� 8 :��5�*��'��+��:��(��<�DD��+ ��=�

� ��

��5�*�� *�8�4�8�8��#��+��

� ��*��*��:��*�+��8��*��8��*��88��+� ��5�� + ��'�� 8F�� + �� *�� <=:�� *�� +��8 �88�� +��8 �� ''� �� (�� +�� :�+�8*��+��+� ��5<�=��8� ��+��+��*��*�� *�� 8�� %G�� 8��*��

� H��+�*��H��+�*��:��*�+��8�� 8��(�*��'�� *�� 8:�� <�= :�� *�� +�� 4��8 �� + ��(��8F�� :�+�8*��4��8��+ ��<�=:�� :��*��88�8 �� +��4��+�� 88�� +�*�� *�� 8�� %G�� 8��*��8 ��*��+*��+��<��+��8��'��+��#��*�� + ��=�

��

� 6��8*��6��8*��:��*�+��8�� 8��+� ��88�� +� ��+�*��8��5��88��8�� 8��8��*��8 �� 5*5��8� �8��'��+��"��+��(��8��+ ��+�*��8��( ��8��4��*�� 8�� *��8C+��8�

bit18 bit0bit17bit19

IC IC DP I I I I 0 D1 D1 D1 D1 D2 D2 D2 D2 D3 D3 D3 D3


IC IC DP

Instruction Coding Area


IC IC DP 0 0 0 o o o o o o 1 1 1 o o o o o

128

A a d d d0 0 Data register 0 to 70 1 Data register 7 to 151 0 Long registers 0 to 71 1 Address registers 0 to 7

��

��3��8��+ ��+ �+��8� �8�� :*�8��8�?��8�� 5��+ �+��+��8��!! �� + ��@#A@"A�� *��(��8:�� 8��*�� 8�� 8� �8� �� 8 �� +��+� ��:�� )DE8�� + ��8� �8��

�� 8��*�� 8� �8�� *+��8+�� 8�� 8 � �� 8��8��+ �� 8��88� �8��:�� :��*�+�8��88+�� +��:��*��)DE�� :�� +��8��8��(��8�8�+� ��

��

�� + �� 8�� *�8 �� 8�� <� �� (�� 8�� 8 �� +��$=��8 �� +�� .� � ��: 8�� *�� + �� +� :�� +��+*��+ ��+ ��+ ��+�8��+��*��5*5��4��+��5�*�� +�� 8��+ ��+�� 8 ��+��+ ��+��8��8��*� �+��:��*��5+��8��+ ��+��:�� 4��8�� 8�� 8� �� 4��8 �� 5�� + �� +��<��DE�I�� +��=

FRPSXWDWLRQDO

LQVWUXFWLRQ

GHFRGHU

6SUHDGVKHHW

'HFRGH�7UHH

*HQHUDWLRQ2XWSXW�*HQHUDWLRQ

ORDG�VWRUH

LQVWUXFWLRQ

GHFRGHU

EUDQFK

LQVWUXFWLRQ

GHFRGHU

�&RQWDLQHU

9+'/

SDFNDJH

UHJLVWHU

FRQILJXUDWLRQ'DWDEDVH

129

�� LD_Family DLA�� SR_Family DAR�� LSLR�� LSRR�� LSR32R�� ASRR�� BSR_Family�� BRCC_Family�� BSR_Family

��

�� 8��+*��+ ��8�� *�8��+*��+ ��+�<��:#��+��$=��+ ��+��8�4�8�8��"�+*��+ ��:�� *��+ �� 8��:��( �� 8��5�*��*��:��+ ��+��+ �� 8��<��7�JK��+��87J��K��+�=��+ ��+�:�� 5+��+*��*�� 8��5��:�*�4��

�� !! �� :�� +��8 �� + �� +�� )DE��@�A�

� '�(�8��, *- ��'.� ��)/��)0��-��12��30++3'�(�8��+��8 �� 8� ��8��8�� (�� *�4��+ ��8��)DE4��*��

� ��, *- ��'.� ��)��)0��-��4-��40!�*/'. ��/��)+!�*/!��-�� !�4!��/-'. ��5� ��)+++3�� +��8 �� 8� �� + �� +� �� 8:��8� �8��:��*��+��8��

� ��*�� , *- ��'.� ��)"��"��)0��-��4-��4��0!0�� 6��470!�8�8!�*/'. ��)+++3��*�� :�� *� *+�� +� �� 8 ��8�� 4��+��:�� 5*��8�4��+ ��:��8��*��4� ��8��+ ��:��8:��*�8��8*5��8��

� ��*��, -$�9%-/��'.� ��)"��)8�8!� 0��-��4-�4��0!� */'. ��)+++3��*��+��8��8�� 5��5�*��+ ��:��8��8�)DE ��+ ��

� �'��*��:�, -$�9%-*��2��*��)8�8!�848!�*/'. ��)++3��+��8 �� 8��5��5�*�� + ��:��8 �� 8�)DE ��+ ��

130

� ��:�(�"5-��2��)0��)4;��<+0+3�� +��8 �� 8��5 �� 5�*�� + ��:��8��8�)DE ��+ ��

�� 8��*�� :�� *� *+�� +� �� 8�� 8 �� @$A�?��8��8�� +�+� �� 5 �� :��*�8��4��*��+�+� �8�� + �� +�� +* ��+ ��:�� *��8 �� + ��+�� + ��+�� :��*� ��8 �� 8 �� 5 �� :��*�8��*��:��+ ��+��8��4�� 8��5�� + �+��

��+ �� 8��:�� 4�*��*+��+��8��*��:��*��8��+ �+��.� ��8��*��=��8��>��:�� +��8��+ ��*��5�*��.� �*�� + ��+��

��

�� +��+�� 8�� *5 � �� +��4� �+� �� + �� 5�� 8�:��*� �� 8��+�� *��*�� 4��*��zeroone��8don’t care��don’t care��:��*� �4��8�� 5 *� �+�� *�� +�� +�+��8 ��*�� 5�*�� + ��+��zero��one��:�+�8*� �4��8�� 4��+�� 5�*��:��*��4��*��5��<a��+��G=

Bit1

B it2

Bit3

Bit4

Bit5

Bit6

Bit8

Bit7

Bit10

Bit11

Bit12

Bit13

0

_

0 1_

0 1

1

0

0

1

1

0

1

1

1

1

1

0

1

1

1

_

0

_

_

0 1

_

0

0 1

_

_

_

_

0 1

0

0 1

_

_

_

_

0

0 1

_

_

_

_

0

1

0

_

1

_

0

0

_

_

0

0

0

Bit9

root of the tree

a

c

b

d

131

��5�:�*�� 4��*��<zero��8one=��:��*��8�� 8��one*�� <b��+��G=��8 ��*�� 8:��+��5�+��*�� 8��+��5�*��:��*��+�8�� 8�� *� �4��8*5�� 5�*�� <don’t care= �� 8 �� *�� *� �� zero*�� <c��'��+��G=:�� *��8��8�� 8�� 8��*�� *�� 8*� �+��+��*�� <d��'��+��G=��+� �� +��*��+��*��+�:��*��8�� 8 �� + �+�� :�� 4�� + �� +�� :�� *� ��8 :�� 8��8��*��*��<'��+��&��+�� 8��(��=�

�� 8� �8�� :�� *� +��8 �� +�� 5 �� )DE 8�� + �� 8� �8�� D�� 8�� 5 �� + �� 8�� 8�)DE �8�� *5 ��+ ��4�8��*��5�� + �� <�� 8� 8��5 ��:�� :�� 8�� 8 �� 8�� = :��+� �88�� )DE �8�� 8 �� 8 4�� 8 ��D� �8��*��8�4��8��!!��8:��*�+��8��8�4��C� �� +��*��D��

��

@%A 1� E�)��5 ��8D�� Computer Architecture. A Quantitative Approach;��,�+��+*��;��%��@�A��E��5 1�7��8.��E��DSP Processor Fundamentals, Architecturesand Features��...��2�:L��%�GG�@#A ��8�M ?�� !! ��+�� D��88��?��5%��G�@"A.�� H��J� ��8)��J��1��1��8��Design Patterns - Elementsof Reusable Object-Oriented Software�88��?��5%��@�A�� 1��8��The Designer`s Guide to VHDL;��,�+��+*��'�� %��@$A3�7��5��Die C++ Standard Template Library�88��?��5%��&�

case instruction (11) is when ’1’ => -- ADDI_Family cmp_instruction := addi; cmp_ex1_add1.en := ’1’; cmp_ex1_add1.add_const := ’1’; cmp_ex1_write1 := setDxLxRx(instruction(4 downto 0)); cmp_ex1_read1 := setDxLxRx(instruction(4 downto 0)); cmp_ex1_cntrl1.const := signExtend16(instruction(10 downto 5)); when others => -- MOVR_Family cmp_instruction := movr; case instruction(10 downto 8) is when "000" => cmp_ex1_write1 := setData(instruction(3 downto 0)); cmp_ex1_read1 := setData(instruction(7 downto 4)); …

� ��

132

scalable dsp core architecture addressing compiler ...edu.cs.tut.fi/panis483.pdf · christian panis...

Documents