Download - Abstract - École Polytechnique Fédérale de Lausannelsl...le même niv eau de p erformance. DEVIL p eut exécuter jusqu'à deux instructions en parallèle à c haque coup d'horloge

Instruction-Level Parallelism for Low-Power

Embedded Processors

THÈSE No 2110

Présentée au Département d'informatique

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNEpour l'obtention du grade de Docteur ès sciences techniques

par

Jean-Michel PuiattiIngénieur Informaticien

de l'Ecole Politechnique Fédérale de Lausanne, Suisse

présentée au jury:

Prof. Eduardo Sánchez directeur de thèseProf. Christian Piguet corapporteurProf. Wen-mei Hwu corapporteurProf. Alain Wegmann corapporteur

Lausanne, EPFL

1999

Abstract

In recent years, the market for special-purpose devices designed for advanced applications

has grown at a tremendous rate. As a result, the demand for embedded microprocessors,

a necessary component of these devices, is stronger than ever. The nature of devices such as

Personal Digital Assistants (PDAs), mobile phones, printers, and networking equipment requires

that these embedded processors meet high performance levels while simultaneously satisfying

strong constraints on power consumption and cost.

Instruction-Level Parallelism (ILP) is one of the major forces increasing the performance

of high-end workstation processors. Such ILP architectures are highly complex and exhibit

a large amount of power dissipation. However, parallelism is also a well-known power-saving

technique that can be used to improve the energy e�ciency of a system. ILP can thus be a

very attractive technique for embedded processors that require increased performance at a low

energy consumption.

This work focuses on the design of synergistic hardware-compiler ILP architectures, such

as EPIC or VLIW machines, for low-power embedded processors. Such synergism minimizes the

hardware overhead of multiple-issue pipelines, while maintaining the performance bene�ts of

ILP. Introducing parallelism into a processor drastically alters its architecture. To understand

and quantify how such modi�cations can reduce or nullify the expected bene�ts, and also to

assess where the tradeo�s should be made, a new EPIC-like low-power processor, DEVIL, is

proposed. Its implementation is the subject of a detailed experimental evaluation.

DEVIL includes a fetch mechanism that supports variable instruction lengths and allows

the compiler to explicitly encode parallelism within an instruction bundle. It will be shown that

this mechanism allows savings of 50% on average in the code size with respect to a standard

VLIW fetch mechanism while keeping performance unchanged.

DEVIL, with its 2-issue pipeline, achieves a speed-up of 1.5 on average compared to a

1-issue processor. This performance enhancement allows DEVIL to work at a lower voltage and

a lower clock frequency while keeping the same level of performance of a scalar processor. It

will be demonstrated that DEVIL can execute a task a the same speed than a scalar processor

while requiring an energy consumption approximatively 38% smaller.

ILP architectures generally su�er from a large amount of code expansion. This negative

e�ect is reduced thanks to DEVIL's instruction fetch mechanism. However, DEVIL still su�ers

from a code size penalty, compared to a scalar processor. To counter this unfortunate fact, a

step is made towards predication techniques. It will be shown that a full-predication support

with an adequate instruction fetch mechanism allows to generate parallel code that is 12% faster

and 25% smaller.

iii

Version abrégée

La forte croissance du marché des systèmes embarqués durant ces dernières années engendre

un important besoin de processeurs performants, sujets à de sévères contraintes de puissance

consommée et de coût. Ces processeurs embarqués équipent en e�et des téléphones mobiles,

des agendas électroniques (PDAs), des imprimantes ou encore des équipements de réseaux

informatiques.

Le parallélisme au niveau d'instruction est l'une des principales techniques permettant

l'augmentation des performances des processeurs équipant les stations de travail. Ces circuits

sont toutefois de haute complexité et leur puissance dissipée s'avère très importante. Le paral-

lélisme constitue également une technique permettant la réduction de la consommation d'énergie

d'un circuit. Cette double caractéristique rend le parallélisme au niveau d'instruction très at-

trayant pour ces processeurs embarqués requérant un haut niveau de performance et une faible

consommation de puissance.

Cette thèse se concentre sur la conception d'architectures parallèles de basse consomma-

tion d'énergie o�rant une synergie entre le compilateur et le matériel, comme par exemple les

processeurs EPIC et VLIW. Cette interaction compilateur-processeur permet le déplacement

d'une grande partie de la complexité des architectures superscalaires vers le compilateur, tout

en conservant le même niveau de performance. Cependant, l'introduction du parallélisme dans

le datapath d'un processeur modi�e fortement son architecture. A�n de comprendre et de

quanti�er les répercussions de ces modi�cations, nous avons développé un nouveau processeur

de type EPIC, appelé DEVIL, dont l'implémentation a fait l'objet d'une analyse détaillée.

DEVIL intègre une unité de fetch particulière permettant d'encoder le parallélisme d'un

paquet d'instructions et supportant des instructions à taille variable. Ce mécanisme permet

l'obtention d'un code 50% plus compact que celui d'un processeur VLIW standard, tout en

maintenant le même niveau de performance.

DEVIL peut exécuter jusqu'à deux instructions en parallèle à chaque coup d'horloge. Cette

caractéristique permet d'augmenter d'un facteur moyen de 1.5 les performances par rapport à

un processeur scalaire. Ce gain compense les pertes de vitesse d'exécution induites par les modes

de fonctionnement à basse fréquence et basse tension requis pour la faible consommation. Nous

montrons que DEVIL exécute des tâches à la même vitesse qu'un processeur scalaire, tout en

consommant 38% d'énergie en moins.

Les processeurs exploitant le parallélisme au niveau d'instruction sou�rent généralement

d'une augmentation de la taille du code. Cet e�et indésirable est restreint par le mécanisme

de fetch inclus dans DEVIL. La taille de code de DEVIL demeure cependant supérieure à celle

d'un processeur scalaire. L'exécution à prédicats (ou exécution conditionnelle) constitue une

solution à ce problème. Les résultats de nos travaux établissent que, moyennant l'exécution à

prédicats et une unité de fetch adéquate, il est possible de générer du code 12% plus rapide et

25% plus petit.

v

Acknowledgments/Remerciements

Pendant toute la durée de ce travail j'ai eu l'occasion de pouvoir compter sur l'appui de nom-

breuses personnes. J'aimerais de tout mon coeur leur dire MERCI.

En premier lieu j'aimerais exprimer toute ma gratitude à ma famille qui a toujours été à

mes côtés et qui m'a transmis son amour, sa folie et sa joie de vivre.

Charo, merci de m'avoir donné tout ton amour pendant ces quatre dernières années et de

n'avoir jamais hésité à me soutenir et être à mes côtés dans les moments les plus di�ciles.

Cette thèse n'aurait pas existé sans la contribution de diverses personnes qui m'ont permis

de travailler sur ce sujet passionnant. Eduardo, mon directeur de thèse, merci de m'avoir

guidé, d'avoir toujours été à l'écoute de mes problèmes, et de m'avoir fait partager ta culture

et ta passion, la musique. Daniel, merci de m'avoir acueilli dans ton laboratoire et de m'avoir

donné autant de liberté. Je remercie tout particulièrement le Centre Suisse d'Electronique et

de Microtechnique (CSEM) qui a �nancé ce projet de thèse. Merci à Christian et à toute son

équipe, toujours disponibles quand j'en avais besoin. Un merci particulièrement chaleureux à

mon vieil ami Flavio qui m'a beaucoup aidé. Je remercie également Michel Benard pour ses

conseils et sa générosité. Merci au jury de thèse, composé des professeurs J.-D. Nicoud, C.

Piguet, W.-M. Hwu et A. Wegmann, pour ses suggestions.

J'aimerais remercier mes collègues du LSL qui ont montré une grande disponibilité et une

humeur à toute épreuve. Ils ont su créer une ambiance de travail qu'il sera di�cile de retrouver.

Marlyse, le rayon de soleil du LSL, merci pour toute l'aide que tu m'as apportée. André, alias

�Chico�, mon joyeux complice (une sacrée équipe !), merci pour tous les services que tu m'as

rendus, tout particulièrement pour toutes les petites attentions que tu as eues pour moi durant

les derniers mois, elles m'ont vraiment remonté le moral dans des moments plutôt pénibles.

Fabio dit �El �aco� merci pour ton aide et ton amitié. Jean-Luc, ô grand Dieu de LaTeX, merci

pour tous tes tuyaux et merci pour toutes les corrections que tu as apportées à mes documents.

Jacques-Olivier, �Jacô� pour les intimes, ta gentillesse, ta disponibilité m'ont été d'une grand

aide. Un merci tout particulier pour m'avoir aidé à mener à bien ce travail. Un grand merci à

Gianluca qui a consacré une grande partie de son temps à corriger mon anglais et à organiser

des soirées bien arrosées. Merci aussi de m'avoir transmis les recettes culinaires de ta famille.

Dom, dit �Dominonski�, ça fait déjà un bout de temps qu'on se connaît, merci pour ton amité

et tes conseils. Eméka, le joyeux luron, merci pour ta bonne humeur et pour ton enthousiasme,

que les PàF soient avec toi. Andrés, merci de m'avoir donné un coup de main à chaque fois que

j'en avais besoin, et en plus, toujours avec le sourire. Carlos-Andrés, �vos sos un fenomeno�, je

te remercie de m'avoir fait partager un peu de ta culture latino-américaine. Moshe dit �lucky

luke�, celui qui écrit plus vite que son ombre, merci pour ton aide et pour avoir corrigé mes

documents. Mathieu le philosophe, Enrico �le bô�, Jacques dit �J.K.�, André �le hardeux�, merci

pour avoir partagé toutes vos connaissances.

J'aimerais aussi remercier tous les gens que j'ai connus au LSL et qui ont contribué à rendre

cette période inoubliable. Marco, Christian, Serge, Maxime, ça a été un plaisir de travailler

vii

viii

avec vous, merci pour votre aide. Merci aussi à Georges et Peter de l'ACORT. Christof, merci

pour ta contribution.

Durant ces quatre années j'ai eu l'opportunité de travailler avec di�érents groupes. Ces

collaborations m'ont enrichi à tous les points de vue: culturel, scienti�que et émotionnel.

Mil gracias a toda la gente que conocí en el Centro Nacional de Microelectrónica de

Barcelona. Esta estancia fue una experiencia humana inolvidable. Jordi, gracias por haberme

permitido pasar tres meses en tu grupo. Lluis, gracias por todo lo que has hecho por mi, nos

vemos en la proxima �esta! Rosa, eres un encanto, gracias por haberme cuidado tanto. Elena,

por haberme llevado todo los dias al trabajo, aunque fuera tan temprano. Y gracias a todos los

miembros del grupo con quien he compartí momentos de locura! Inolvidable!

Mateo, quien desde el primer contacto me abrió las puertas de su grupo. La estancia en

el DAC fue un momento clave en mi tesis. Gracias por tu ayuda y tu apoyo cada vez que

te necesitaba. Josep y Eduard, gracias a vosotros por vuestra ayuda y a todo el grupo del

Departament de Arquitectura de Computadors por todos los momentos de alegria y de buena

vida compartidos.

Dear Sabrina and Wen-mei, a big THANKS for your friendship, for allowing me to work

with your group and for all your help. You taught me a lot about life and did a lot for me.

Probably, without your help this work had not been acheived. I would like to thank all the

IMPACT members for their help. A particular thank you to my friends David and Dan who

allowed me to work with them. Dan, thank you also for your help in the writing of this thesis.

John and Liesle, thank you for your friendship and for allowing me to share your toys and

candies.

J'aimerais aussi remercier tous mes amis qui m'ont permis d'avoir une vie des plus agréables.

Merci aux nombreux membres Clubmax pour avoir proposé toutes ces activités. Catherine et

Arnaud, merci pour votre amitié et pour avoir corrigé une partie de ma thèse. Merci à mes

colocataires: Mari Carmen et Joaquin pour m'avoir initié à la rodaeta et pour m'avoir présenté

a la femme de ma vie; Carmen et Dimitri pour m'avoir hébergé et savoir que je pouvais toujours

compter sur eux; Dani, Elvira et Gonçalo pour les soirées passées ensemble; Eduardo �Maestro�,

gracias por abrirme tu casa, y por enseñarme la vida nocturna de tu ciudad. Merci, à toute

l'équipe des GIGI (championne de la ligue EPFL 1999) pour les bons matchs joués.

Contents

Abstract iii

Version abrégée v

Acknowledgments/Remerciements vii

1 Introduction 1

2 Instruction-Level Parallelism 3

2.1 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Instruction-Level Parallelism: Concepts and Limitations . . . . . . . . . . . . . 4

2.2.1 Data Dependences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 Control Dependences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.3 Resource Con�icts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 E�ect of Control Dependences in Pipelined Execution . . . . . . . . . . 8

2.3.2 E�ect of Data Dependences in Pipelined Execution . . . . . . . . . . . . 9

2.3.3 Resource Con�ict in Pipelined Execution . . . . . . . . . . . . . . . . . 10

2.4 Superscalar Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 In-order Issue with In-order Completion . . . . . . . . . . . . . . . . . . 12

2.4.2 In-order Issue with Out-of-order Completion . . . . . . . . . . . . . . . . 13

2.4.3 Out-of-order Issue with Out-of-order Completion . . . . . . . . . . . . . 14

2.4.4 Exception Recovery and Register Data�ow in Superscalar Processors . . 15

2.5 Very Long Instruction Word Architectures . . . . . . . . . . . . . . . . . . . . . 16

2.6 Compiler Techniques to Extract ILP . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6.1 Basic Block Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6.2 Superblock Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6.3 Predicated Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Power Consumption in CMOS Circuits 25

3.1 Sources of Power Dissipation in CMOS Circuit . . . . . . . . . . . . . . . . . . 25

3.1.1 Static Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.2 Dynamic Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Metrics for Energy E�ciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Parallelism for Energy E�ciency . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

ix

x Contents

4 Mobile and VLIW Processors:a State of the Art 334.1 The Advanced RISC Machine (ARM) Family . . . . . . . . . . . . . . . . . . . 33

4.1.1 The ARM7 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.2 The StrongARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.3 The ARM Thumb Option . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.4 The ARM Piccolo Option . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.5 The ARM9 and the ARM10 . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 The Motorola M�Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 The LSI TinyRisc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 The Hitachi SuperH Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5 VLIW Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5.1 The Texas Instrument TMS320C6201 . . . . . . . . . . . . . . . . . . . 37

4.5.2 The Motorola-Lucent Star*Core . . . . . . . . . . . . . . . . . . . . . . . 37

4.6 The Philips Trimedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.7 The HP/Intel IA-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.8 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Low-Power VLIW Processors:A High-Level Evaluation 415.1 Description of the Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 CoolRISC 816: A Low-power 8-bit Processor . . . . . . . . . . . . . . . . . . . 43

5.2.1 The CoolRISC 816 Architectural Characteristics . . . . . . . . . . . . . 43

5.2.2 The Performance of CoolRISC 816 . . . . . . . . . . . . . . . . . . . . . 44

5.2.3 The Energy Consumption of the CoolRISC 816 . . . . . . . . . . . . . . 45

5.3 Compared Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3.1 Scalar Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3.2 VLIW Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4 Consumption Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.4.1 Estimate of Eoper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.4.2 Estimate of Ecode and Edata . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4.3 Estimate of Econn and ERF . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.5 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 The DEVIL Low-power Processor 536.1 Where Is The Complexity in VLIW Architectures? . . . . . . . . . . . . . . . . 53

6.1.1 Hardware Duplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1.2 Code Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2 De�nition of the DEVIL Processor . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.3 DEVIL's Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.4 DEVIL's Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.4.1 Arithmetical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.4.2 Logical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.4.3 Compare Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.4.4 Move Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Contents xi

6.4.5 Branch Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.4.6 Data Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.5 The DEVIL Instruction Fetch Mechanism . . . . . . . . . . . . . . . . . . . . . 59

6.6 DEVIL's Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.6.1 Pipelined Execution for ALU Operations . . . . . . . . . . . . . . . . . . 61

6.6.2 Pipelined Execution for Memory Operations . . . . . . . . . . . . . . . . 62

6.6.3 Pipelined Execution for Branch Operations . . . . . . . . . . . . . . . . 63

6.6.4 DEVIL's Branch Prediction Mechanism . . . . . . . . . . . . . . . . . . 63

6.7 Evaluation of the DEVIL Architecture . . . . . . . . . . . . . . . . . . . . . . . 64

6.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.7.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.7.3 DEVIL's Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.7.4 DEVIL's memory utilization . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.8 Comparison With Existing Mobile Processors . . . . . . . . . . . . . . . . . . . 73

6.8.1 Instruction Set Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.8.2 Code Size Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7 Implementation ofthe DEVIL Processor 777.1 Technology and Synthesis Methodology . . . . . . . . . . . . . . . . . . . . . . 77

7.2 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.3 The DEVIL Latch-Based Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.4 DEVIL Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.4.1 DEVIL's Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.4.2 Fetch and Dispatch Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.4.3 Program Counter Datapath . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.4.4 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.4.5 Arithmetic and Logic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.4.6 Load/Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.5 DEVIL Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.5.1 DEVIL's Circuit Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.5.2 DEVIL's Circuit Complexity . . . . . . . . . . . . . . . . . . . . . . . . 85

7.5.3 DEVIL's Circuit Power Consumption . . . . . . . . . . . . . . . . . . . . 85

7.6 Comparison With Existing Processors . . . . . . . . . . . . . . . . . . . . . . . 86

7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8 A Step Towards Predicated Execution 898.1 Architecture Support for Full Predicated Execution . . . . . . . . . . . . . . . . 90

8.2 Compiler Techniques for Reducing Predicated Code Size . . . . . . . . . . . . . 91

8.2.1 Reduction of Number of Control Instructions . . . . . . . . . . . . . . . 91

8.2.2 Predicate Promotion and Instruction Merging . . . . . . . . . . . . . . . 91

8.2.3 Instruction Reduction for Advanced Code Transformation . . . . . . . . 93

8.3 Introducing Predication Support into Embedded Processors . . . . . . . . . . . 94

8.3.1 E�ect on Code Size of Full Predication Support . . . . . . . . . . . . . . 94

8.3.2 Predication Code Size and Execution Characteristics . . . . . . . . . . . 96

8.3.3 Pre�x-Based Predication . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

8.3.3.1 Architecture Model . . . . . . . . . . . . . . . . . . . . . . . . 98

xii Contents

8.3.3.2 Microarchitecture support . . . . . . . . . . . . . . . . . . . . . 98

8.3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

8.3.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

8.3.4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 99

8.4 Control �ow optimization using predication . . . . . . . . . . . . . . . . . . . . 100

8.4.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

8.4.2 Limitations of PlayDoh . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

8.4.3 Overview of Compiler Techniques . . . . . . . . . . . . . . . . . . . . . . 103

8.4.4 Minimization of Program Decision Logic . . . . . . . . . . . . . . . . . . 107

8.4.5 Architecture Support for Synthesis . . . . . . . . . . . . . . . . . . . . . 113

8.4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

9 Conclusion 119

A The DEVIL's Instruction Set Summary 123A.1 Functions De�nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

A.2 Arithmetical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

A.3 Logical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

A.4 Compare Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

A.5 Move Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.6 Branch Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

A.7 Data Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Bibliography 136

List of Figures

2.1 Di�erent type of data dependencies: (a) Flow dependence, (b) Anti-dependence,

(c) Output dependence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Register renaming suppresses anti and output dependences. (a) code before

register renaming, (b) dependence graph before register renaming, (c) code after

register renaming, (d) dependence graph after register renaming. . . . . . . . . 5

2.3 Arithmetic transformation for critical path reduction. . . . . . . . . . . . . . . . 6

2.4 Control dependences: (a) C code, (b) corresponding control �ow graph. . . . . 7

2.5 Instruction timing for a non-pipelined processor. . . . . . . . . . . . . . . . . . 7

2.6 Instruction timing for a four-stage pipelined processor . . . . . . . . . . . . . . 8

2.7 (a) Instructions executed in a 2-stage pipeline, (b) instructions executed in a

4-stage pipeline, (c) instructions executed in a 8-stage pipeline. . . . . . . . . . 9

2.8 Illustration of the control dependencies in a three-stage pipeline. . . . . . . . . 10

2.9 Pipeline stall due to a RAW data dependence. . . . . . . . . . . . . . . . . . . . 10

2.10 Result bypassing avoids the pipeline stall due to one-cycle RAW data dependences. 11

2.11 Delay slot due to a one-cycle load latency. . . . . . . . . . . . . . . . . . . . . . 11

2.12 Execution timing of a generic two-issue superscalar processor. . . . . . . . . . . 12

2.13 Block diagram of a generic four-unit superscalar processor. . . . . . . . . . . . . 13

2.14 Superscalar pipeline with in-order issue and in-order completion. . . . . . . . . 14

2.15 Superscalar pipeline with in-order issue and out-of-order completion. . . . . . . 14

2.16 Superscalar pipeline with out-of-order issue and out-of-order completion. . . . . 15

2.17 In-order, lookahead, and architectural state for an out-of-order issue superscalar

processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.18 Execution timing of a generic two-issue VLIW processor. . . . . . . . . . . . . . 17

2.19 Block diagram of a generic four-unit VLIW processor. . . . . . . . . . . . . . . 18

2.20 Example of formation of VLIW instructions: (a) sequential code, (b) the corre-

sponding dependence graph, (c) the corresponding VLIW code. . . . . . . . . . 18

2.21 Control �ow graph with basic blocks: (a) original C code, (b) corresponding

assembly code, (c) corresponding control �ow graph. . . . . . . . . . . . . . . . 19

2.22 Superblock formation: (a) weighted �ow graph, (b) trace formation, (c) tail

duplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.23 Loop enlarging optimizations: (a) original loop, (b) loop peeling, (c) loop unrolling. 20

2.24 Dependence removing: (a!b) accumulator variable expansion, (b!c) induction

variable expansion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.25 (a) A simple if-then-else C code construct, (b) unpredicated code, (c) predicated

code, and (d) optimized predicated code. . . . . . . . . . . . . . . . . . . . . . . 22

3.1 Static CMOS inverter: (a) gate, (b) transistors, and (c) switches representation. 25

3.2 Short-circuit current in a static CMOS inverter. . . . . . . . . . . . . . . . . . . 26

xiii

xiv List of Figures

3.3 Charge and discharge of the load capacitance in a static CMOS inverter. . . . . 27

3.4 Relative circuit delay (left) and relative energy consumption (right) as function

of Vdd. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Energy distribution of several processor con�gurations executing the same task. 30

4.1 Comparison: (a) MIPS vs. Power; (b) MIPS vs. mw/MIPS. . . . . . . . . . . . 39

5.1 Block diagram of the experimental framework. . . . . . . . . . . . . . . . . . . . 42

5.2 Parallel execution of a loop using software pipelining. . . . . . . . . . . . . . . . 43

5.3 Energy consumption distribution in the CoolRISC 816. . . . . . . . . . . . . . . 45

5.4 VLIW architecture: NOP elimination. . . . . . . . . . . . . . . . . . . . . . . . 46

5.5 Speed-up comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.6 Energy comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.7 Energy-Delay Product comparison. . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1 ROM code memory die area as a function of code size. . . . . . . . . . . . . . . 55

6.2 Power consumption of the ROM code memory as a function of code size. . . . . 56

6.3 Block diagram of the DEVIL architecture. . . . . . . . . . . . . . . . . . . . . . 57

6.4 Instruction bundle formation in the DEVIL processor. . . . . . . . . . . . . . . 61

6.5 DEVIL's pipeline: ALU operations. . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.6 DEVIL's pipeline: memory operations. . . . . . . . . . . . . . . . . . . . . . . . 62

6.7 DEVIL's pipeline: conditional branch operations. . . . . . . . . . . . . . . . . . 63

6.8 DEVIL's branch prediction mechanism. . . . . . . . . . . . . . . . . . . . . . . 64

6.9 DEVIL's compile-time branch prediction bene�ts. . . . . . . . . . . . . . . . . . 64

6.10 The IMPACT compiler framework. . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.11 DEVIL performance with and without superscalar optimizations compared to

1-issue and 4-issue architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.12 E�ect of superscalar optimizations on code size. . . . . . . . . . . . . . . . . . . 68

6.13 E�ect of superscalar optimizations on the number of accesses to the code memory. 68

6.14 E�ect of NOP elimination on code size. . . . . . . . . . . . . . . . . . . . . . . 69

6.15 E�ect of NOP elimination on the number of accesses to the code memory. . . . 70

6.16 E�ect of the variable instruction length mechanism on code size. . . . . . . . . 71

6.17 E�ect of the variable instruction length mechanism on number of accesses to the

code memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.18 E�ect of the DEVIL instruction fetch mechanism on the code size. . . . . . . . 72

6.19 E�ect of the DEVIL instruction fetch mechanism on the number of accesses to

the code memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.20 Code size comparison between DEVIL and some other mobile processors. . . . 74

7.1 A two-phase non overlapping pipeline using latches. . . . . . . . . . . . . . . . . 79

7.2 DEVIL's pipeline implementation with non-overlapping clocks. . . . . . . . . . 80

7.3 DEVIL datapath block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.4 Fetch and dispatch datapath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.5 Program counter datapath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.6 Register �le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.7 ALU datapath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.8 Data Memory Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.1 Predication example: (a) original, (b) optimized, and (c) predicated. . . . . . . 92

List of Figures xv

8.2 Merging example: (a) source code, (b) original, and (c) predicated. . . . . . . . 93

8.3 Loop optimization example: (a) original, (b) unrolled superblock, and (c) un-

rolled predicated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

8.4 Relative number of predicated instructions. . . . . . . . . . . . . . . . . . . . . 95

8.5 Code expansion considering predication source operand. . . . . . . . . . . . . . 96

8.6 Code reductions due to predicated execution. . . . . . . . . . . . . . . . . . . . 97

8.7 Pre�x-based predication decoding of normal and predicated instructions. . . . . 98

8.8 Performance of varying instruction cache size for pre�x-based predicated archi-

tecture relative to non-predicated architecture. . . . . . . . . . . . . . . . . . . 99

8.9 Code expansion of superscalar relative to traditional optimization. . . . . . . . 100

8.10 A portion of the inner loop of the UNIX utility wc. The control �ow graph (a),

and the corresponding hyperblock formed after complete if-conversion (b). . . 104

8.11 The wc hyperblock after speculation but before logic minimization (a) and its

corresponding logic diagram (b). The hyperblock after logic minimization (c)

and its corresponding logic diagram (d). . . . . . . . . . . . . . . . . . . . . . . 105

8.12 Comparison of the static schedules for the wc hyperblock before and after logic

minimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8.13 Example: optimization of wc predicate network. . . . . . . . . . . . . . . . . . . 108

8.14 Pseudo-code for performing optimization of predicate expressions . . . . . . . . 109

8.15 Factorized predicate de�ne optimization. . . . . . . . . . . . . . . . . . . . . . . 112

8.16 Various methods of predicate expresssion regeneration. . . . . . . . . . . . . . . 113

8.17 Speedup from minimization of program decision logic. . . . . . . . . . . . . . . 115

List of Tables

3.1 Summary of the bene�ts of parallelization and voltage down-scaling. . . . . . . 31

4.1 Mobile, Embedded, and ILP processor comparision. . . . . . . . . . . . . . . . . 39

5.1 Characteristics of CoolRISC's low-power ROM (Vdd=3V). . . . . . . . . . . . . 44

5.2 Relative utilization of the core, the code memory, and the data memory . . . . 45

6.1 Execution modes of DEVIL's instruction bundles. . . . . . . . . . . . . . . . . . 60

6.2 Benchmark list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.1 Transistor count breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.2 Power consumption breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.3 Summary of the bene�ts of ILP for low-power. . . . . . . . . . . . . . . . . . . 86

7.4 Mobile, Embedded, and ILP processor comparison. . . . . . . . . . . . . . . . . 86

8.1 Predicate de�nition truth table. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8.2 Instruction merging and predicate promotion characteristics. . . . . . . . . . . . 97

8.3 Extented predicate de�nition truth table. . . . . . . . . . . . . . . . . . . . . . 102

8.4 Speedup and predicate de�ne count for selected functions. . . . . . . . . . . . . 116

8.5 E�ects of conjunctive-type predicate de�nes on speedup and instruction count. 118

A.1 DEVIL's arithmetical instructions . . . . . . . . . . . . . . . . . . . . . . . . . . 124

A.2 DEVIL's logical instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

A.3 DEVIL's compare instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

A.4 DEVIL's move instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.5 DEVIL's branch instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

A.6 DEVIL's load/store instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

A.7 DEVIL's load/store instructions (second part) . . . . . . . . . . . . . . . . . . . 130

xvii

Chapter 1

Introduction

In recent years, the market for special-purpose devices designed to perform advanced applica-

tions has grown at a tremendous rate. As a result, the demand for embedded microprocessors,

a necessary component in these devices, is stronger than ever. The nature of devices such as Per-

sonal Digital Assistants (PDAs), mobile phones, printers, and networking equipment requires

that these embedded processors meet high performance levels while simultaneously satisfying

constraints on power consumption and cost.

To be competitive, manufacturers o�er a wide range of products that meet these strong

design constraints and place a high priority on the energy e�ciency, a crucial feature, of such

processors. Obviously, minimizing power consumption increases the autonomy of portable sys-

tems, such as mobile phones, and increases the product's worth to the consumer. In addition,

energy e�ciency has an e�ect on total system cost, which may be even more important in some

applications. Reducing the power dissipation in a integrated circuit reduces the price of the

packaging, of the power supply, of the heat dissipation mechanism, and also increases the chip's

reliability.

In the design of processors, a trade-o� is routinely made between performance, power con-

sumption, and cost. In fact, most techniques developed to enhance performance in high-end

systems increase the cost of the system and its power consumption. For example, instruction

caches, data caches, sophisticated branch predictors, hardware duplication, and dynamic in-

struction schedulers increase a circuit's complexity, which may imply a relatively large amount

of extra power dissipation.

In return there exist some performance-enhancing hardware features that can also re-

duce power consumption. Using such features in conjunction with clock frequency and voltage

down-scaling, may result in lowering of total energy required to complete the task; the overall

performance, although the clock frequency has been reduced, remains the same [19]. Such a

feature adds value to the product by increasing either performance or energy e�ciency, or both.

Clearly, designers should embrace these techniques whenever possible.

Parallelism is one such technique. Currently, Instruction-Level Parallelism (ILP) is one of

the major forces increasing the performance of high-end workstation processors. The resulting

architectures are highly complex and exhibit a large amount of power dissipation. A known

example is the DEC Alpha 21264 which, with more than 15.2 million transistors, dissipates as

much as 70W [2]. However, parallelism is also a well known low-power technique that can be

used to improve the energy e�ciency of a system [19].

Investigations into the overall energy e�ciency of pipelined and superscalar architectures

1

2 Introduction

in general purpose processors demonstrated that superscalar techniques does not signi�cantly

improve the energy e�ciency of a processor [24]. This is due mostly to the overhead introduced

by the superscalar approach. However, parallelism and pipelining can still be employed to tune

the power consumption versus performance trade-o� [21, 19]. The key is to employ parallelism

and pipelining while reducing the overhead found in superscalar architectures through the use

of advanced compiler techniques. With combined hardware and compiler techniques, much of

the work performed by standard superscalar processors can be moved from run time to compile

time. Speci�cally, explicitly encoding the parallelism found at compile-time in the instructions

signi�cantly reduces the overhead found in superscalar processors. Explicitly Parallel Instruc-

tion Computing (EPIC) is the term used to describe architectures using this approach [25].

Currently, ILP has only been introduced into embedded processors through pipelined ar-

chitectures and, generally, such mobile processors do not include any multiple-issue mechanism.

There are some few exceptions. The Hitachi SH7750 [30], for example, is based on a 2-issue

superscalar architecture, but there are limitations on the available machine parallelism and the

processor exhibits a rather heavy power consumption of 1.6 W at 200 MHz at 1.8 V. Also, the

new generation of DSP architectures is based on multiple-issue Very Long Instruction Word

(VLIW) architectures, to exploit the large amount of instruction parallelism that can be found

in digital signal processing [74] [4].

The present thesis aims at �lling the lack of energy-e�cient multiple-issue embedded archi-

tectures. The bene�ts of ILP in low-power mobile processors are investigated in order to know

whether it can improve the trade-o� between performance and energy consumption � or not.

More precisely, this work focuses on the design of synergistic hardware-compiler architectures

such as EPIC or VLIW machines. Such synergism allows to minimize the hardware overhead of

multiple-issue pipelines, while maintaining the performance bene�ts of ILP. However, introduc-

ing parallelism into a processor drastically alters its architecture. It is necessary to understand

and quantify how such modi�cations can reduce or nullify the expected bene�ts and also to

assess where the tradeo�s should be made. Accordingly, a new EPIC-like low-power processor,

DEVIL, is proposed. Its implementation is the subject of a detailed experimental evaluation.

The thesis is structured as follows. Chapter 2 provides an introduction to the ILP tech-

niques and describes the relevant concepts for this work. An introduction to power consumption

in CMOS circuits is given in Chapter 3, along with an explanation of how parallelism can be used

to improve the energy e�ciency of a digital circuit. In Chapter 4 the state of the art in mobile

and VLIW processors design is given, highlighting the lack of energy-e�cient multiple-issue ar-

chitecture. Chapter 5 provides a high-level evaluation of the bene�ts of VLIW architectures for

low-power processors. DEVIL, a new low-power VLIW architecture, is proposed in Chapter 6;

its detailed architectural evaluation follows. In Chapter 7 the DEVIL's VLSI implementation is

given; the analysis of DEVIL's features, in terms of speed, power consumption and complexity,

is also reported. The bene�ts of predicated execution support for embedded processors are

analyzed in Chapter 8. Finally, in Chapter 9, concluding remarks are drawn.

Chapter 2

Instruction-Level Parallelism

The performance of modern processors is becoming highly dependent on their ability

to execute multiple instructions per cycle. These processors extract performance from

programs by exploiting the characteristic of Instruction-Level Parallelism (ILP). ILP

is extracted either at compile time or at run time from a program composed of sequential in-

structions. Thus an important feature of ILP techniques is that like circuit speed improvement,

they are generally transparent to users. Pipelined, superscalar and Very Long Instruction Word

(VLIW) processors are examples of processor architectures that derive their bene�t from ILP.

Superblock and hyperblock formation are examples of compilation techniques that expose the

parallelism that these processors can use.

First, this chapter brie�y introduces the main performance metric in order to identify the

factors that contribute to get a high level of performance. Then, it describes the most signi�cant

concepts of ILP as weel as their limitations. Furthermore it gives an insight to both hardware

and software techniques that exploit ILP. Readers who are interested in ILP concepts can refer

to the extensive literature such as [29] [58] [33] [53].

2.1 Performance Metric

The most common and reliable metric used for performance comparison between processors

is execution time, Texec, needed to execute a given task. Texec depends on three di�erent

parameters: the number of executed instructions, N , needed to execute the given task; the

number of instructions executed per clock cycles, IPC; and the processor clock frequency, f :

Texec =N

f � IPC(2.1)

For a given task, the processor with the lowest execution time is the best processor in

terms of performance. Generally, the comparison is established by computing the speed-up, S,

achieved by an architecture A compared to an architecture B:

S =TB

exec

TAexec

(2.2)

The ILP techniques that are described below aim to improve one or more of the parameters

N , f , and IPC, to enhance the processor performance. For example, pipelining increases the

3

4 Instruction-Level Parallelism

processor throughput by boosting f and IPC; superscalar processors augment the number of

instructions executed per clock cycle; and VLIW machines reduce the number of instructions

required to execute a task. Each of these techniques has its advantages, disadvantages and

limitations, the following subsections describe these features.

2.2 Instruction-Level Parallelism: Concepts and Limitations

The traditional way to code a program is to express an algorithm in a sequential language such

as C. After compilation, this results in an ordering of assembly instructions that are executed

sequentially. Although the merit of this approach is its simplicity, such sequences result in a

relatively poor level of performance. To overcome this limitation, ILP techniques are used to

expose independent instructions in a sequential program. With adequate hardware support, the

execution of such independent instructions can be parallelized, reducing the program execution

time.

The performance improvement that is given by instruction-level parallelism strongly de-

pends on the ability to �nd independent instructions. Data dependences, control dependences,

and resource con�icts are the fundamental limitations that bound the amount of available par-

allelism, and therefore the potential increase in performance. The next subsections describe

these dependences and the way to reduce their impact on performance.

2.2.1 Data Dependences

Data dependences occur between instructions that use the same operands, either registers or

memory. Data dependences are classi�ed in three categories:

� Flow dependence, or Read After Write (RAW) dependence, happens when an instruction

i2 has a source operand that is the result of a previous instruction i1, forcing i1 to be

executed before i2. This is the only true dependence (see Figure 2.1(a)).

� Anti dependence, or Write After Read (WAR) dependence, occurs in the opposite case

when an instruction i4 de�nes its result in an operand that is a source of a previous

instruction i3. Consequently, i3 must read its operands before i4 writes its results (see

Figure 2.1(b)).

� Output dependence, or Write After Write (WAW) dependence, happens when two instruc-

tions i5 and i6 write in the same destination operand. In this case i5 must be scheduled

in such a way that it writes its result before i6 (see Figure 2.1(c)).

add r2, r3, 3 add r3, r2, r1 add r4, r1, 5

mul r4, r2, r1 mov r2, 10 mov r4, 10

Flow dependence Anti-dependence Output dependence

(a) (b) (c)

Tim

e

i1:

i2:

i3:

i4:

i5:

i6:

Figure 2.1: Di�erent type of data dependencies: (a) Flow dependence, (b) Anti-dependence,

(c) Output dependence.

2.2 Instruction-Level Parallelism: Concepts and Limitations 5

These dependences limit code motion and optimizations at both compile time and run

time. This is why several techniques have been proposed to break such constraints. Register

renaming is the main technique that eliminates anti and output dependencies. Figure 2.2 shows

how register renaming works. Before register renaming (Figure 2.2(a)(b)), the code may contain

arti�cial dependences due to register allocation, thus limiting the parallelism. For example, i1

and i3 can not be executed in parallel because of the output dependence caused by the register

r3. When register renaming is applied (Figure 2.2(c)(d)), all the WAR and WAW dependences

are suppressed (Figure 2.2(d)) by renaming each register between each of its rede�nitions. For

example, instruction i1 rede�ne register r3, therefore r3 is renamed into r3a until the next

rede�nition of r3 (i3)

i1: sub r3, r3, r5i2: add r4, r3, 1i3: add r3, r5, 1i4: div r7, r3, r4

(a)

i1: sub

i2: add

i3: add

i4: div

i2: add

i3: add

i4: div

(b)

r3

r3

r4

r3

Output dependency

r3

(c)

i1: sub r3a, r3, r5i2: add r4, r3a, 1i3: add r3b, r5, 1i4: div r7, r3b, r4

(d)

r4

r3a

r3b

i1: sub

Anti-dependency

Flow dependency

Figure 2.2: Register renaming suppresses anti and output dependences. (a) code before register

renaming, (b) dependence graph before register renaming, (c) code after register renaming, (d)

dependence graph after register renaming.

Flow dependences can not be eliminated with register renaming; they inherently de�ne the

data �ow of the program, and are muchmore di�cult to eliminate. However, some optimizations

can modify the program data �ow. For example, arithmetic properties can be used to re-express

the data �ow of a sequence of instructions. Figure 2.3 illustrates how associativity can be used

to generate a more parallel code by better balancing the dependence graph. Other techniques

that eliminate �ow dependence are described in Subsection 2.6.2.

All the previous examples show data dependences between register operands, however there

can be also memory data dependences. These later occur between accesses to the same data

memory location, and introduces a new problem known as memory disambiguation. Although


++

+

a

b cd

e +

+

+

a

b c d e

a = b + c + d + e

a = (b + c) + (d + e)a = ((b + c) + d) + e

3 cl

ock

cycl

es

2 cl

ock

cycl

es

Figure 2.3: Arithmetic transformation for critical path reduction.

it is easy to detect dependences between accesses to the same variable, it is much harder to

know if accesses made through pointers are independent. Indeed, at compile time, it is not

always possible to know the location of memory addressed by a pointer, and therefore some

memory dependences can not be resolved. In order to maintain the program correctness, in this

case, the scheduler should assume that there is a data dependence, which may conservatively

limit the parallelism of the program.

2.2.2 Control Dependences

Branch operations create another type of dependence, the control dependence that occurs be-

tween branch operations and instructions ordered after the branch. Figure 2.4 shows the control

�ow graph of a if-then-else structure, where the inc a, dec a, and jmp instructions are control

dependent of beq x, 0 because their execution depends on the outcome of the branch. The

main di�erence between control and data dependences is that the former are characterized by

run-time uncertainty since the target of the branch is not known until the end of the execution

of the conditional branch. This is why exploiting ILP in the presence of branches has been the

subject of much research. The two most commonly used techniques are control speculation and

predication. Control speculation is most commonly performed in superscalar processors using

a combination of branch prediction and dynamic scheduling [32][59][81]. Control speculation

increases ILP by guessing the outcome of a branch and executing instructions along the pre-

dicted path. In this manner, control dependences are broken to execute instructions before the

branch outcome is determined. Given an instruction set that supports speculative operations,

control speculation can also be performed statically by an aggressive compile-time scheduler

which moves instructions across branches [7][40].

Predication has become a popular instruction set architecture feature for expressing pro-

gram control by conditionally executing instructions [31][54]. A compiler can employ if-conversion

to convert a sequence of code containing branches into an equivalent branch-free sequence of

conditionally executed instructions [6]. Predicated execution increases ILP by allowing the

compiler to schedule operations from multiple paths of control for simultaneous execution.

2.2.3 Resource Con�icts

The number of available resources, such as functional units and register �le ports, also limit

the level of parallelism. Consequently, resource con�icts occur between two instructions that

require the same hardware resource at the same time. For example, if a processor has only

2.3 Pipelining 7

mov a, 0beq x, 0

inc ajmp

...

dec a

if ( x == 0) {a = 0;

a++;} else {

a--;}

(a)

(b)

taken not taken

Figure 2.4: Control dependences: (a) C code, (b) corresponding control �ow graph.

one memory port, two independent memory load operations must be executed sequentially.

Hardware duplication, i.e., to add an extra memory port in the above example, allows to

suppress such con�ict. Obviously, a trade-o� exists between circuit complexity and performance

enhancement.

2.3 Pipelining

The �rst generation of microprocessors generally executed and issued instructions in a purely

sequential way, and required several clock cycles to execute each instruction (see Figure 2.5).

The overall e�ect of the sequential execution was a very low instruction throughput.

CLK L��H

instr. 1 .�VVVVVVVVVVVVVV�.........................................Fetch, Dec, Exec, W.B.

instr. 2 .................�VVVVVVVVVVVVVV�.........................Fetch, Dec, Exec, W.B.

instr. 3 .................................�VVVVVVVVVVVVVV�.........Fetch, Dec, Exec, W.B.

Figure 2.5: Instruction timing for a non-pipelined processor.

To reduce the performance penalty due to the sequential execution, pipelining has been

introduced in the processor architectures. Pipelining exploits instruction-level parallelism, by

dividing instruction execution in independent stages (i.e. pipeline stage), and overlapping

their execution. Therefore, several instructions are executed in parallel, but are still issued

sequentially. Figure 2.6 shows the execution timing of a four-stage pipeline (Fetch, Decode,

Execute, and Write back).


CLK L��L

instr. 1 .�VV�VV�VV�VV�...........F D E WB

instr. 2 .....�VV�VV�VV�VV�.......F D E WB

instr. 3 .........�VV�VV�VV�VV�...F D E WB

F=Fetch

D=Decode

E=Execute

WB=Write Back

Figure 2.6: Instruction timing for a four-stage pipelined processor

Ideally, expending the number of pipeline stages increases the number of instructions that

are executed in parallel. This division reduces the processor's cycle time, and generally results

in a large performance improvement. Figure 2.7 shows how the throughput (i.e. performance)

increases with the number of pipeline stages, motivating the use of deep pipelines. However, the

bene�ts of pipelining are bounded by data and control dependences, and their e�ect becomes

stronger when the number of pipeline stage is increased. The next subsections describe how

dependences a�ect pipelined execution.

2.3.1 E�ect of Control Dependences in Pipelined Execution

Pipelined architectures generally need to fetch one instruction per clock cycle. When a condi-

tional branch is fetched, there is a delay before the direction of the branch (i.e. the address

of the next instruction) is resolved. Therefore, one or more instructions that are fetched after

the conditional branch instruction can come from the wrong path of execution. Figure 2.8

illustrates this phenomenon when there is no branch prediction, i.e, the branches are always

predicted as not taken. In cycle T2, when the conditional branch instruction is fetched, the pro-

cessor should compute the address of the next instruction. As the conditional branch, jt/jnt, is

still not decoded and the comparison, test.eq, is still not executed, the processor fetches the next

sequential instruction. During phase T3 the result of the comparison is computed and the new

program address is updated with a one-cycle delay. If the conditional branch is not taken, the

pipeline can continue the execution normally. However, if the branch is taken the instruction

fetched during cycle T3 comes from the wrong path, and therefore should be nulli�ed.

A simple way to address this problem is to always execute a �xed number of instructions,

referred to as branch delay slots, immediately following all control operations. For example,

in Figure 2.8 this correspond to always execute instruction Instr. 3. In this case the compiler

or programmer should put in the branch delay slot any instruction(s) which do not depend

on the branch outcome. This can result in the insertion of NOP operations, i.e., in a loss in

pipeline performance. This approach is simple and works for scalar processors where there is

only one or two instruction(s) in the delay slot. However, when this technique is applied to wide

issue processors, the number of instructions being in the branch delay slot becomes too high.

Consequently, it is impossible to �nd a su�cient number of instructions that can be moved in

the delay slot, resulting in a sever loss in performance and in an increase in code size due to

extra NOP insertion. In this case, another technique should be used to reduce such penalty.

2.3 Pipelining 9

Time

Up to 2 instructions executed in parallel

2-stage pipeline(a)

4-stage pipeline

(b)

8-stage pipeline(c)


Clock freq. = f

Clock freq. = 2f

Clock freq. = 4f


instr. 0

instr. 1

instr. 2

instr. 0

instr. 1

instr. 2

instr. 3

instr. 4

instr. 0

instr. 1

instr. 2

instr. 3

instr. 4

instr. 5

instr. 6

instr. 7

instr. 8

Figure 2.7: (a) Instructions executed in a 2-stage pipeline, (b) instructions executed in a 4-stage

pipeline, (c) instructions executed in a 8-stage pipeline.

Branch prediction enables the processor to speculate the target of branch instructions being

executed before the true branch target has been resolved. For correct predictions, the speculated

operations are useful, and the processor's pipeline is nor adversely a�ected. However, for

incorrect predictions, a costly performance penalty is incurred since the speculated instructions

must be removed from the pipeline and the processor pipeline restarted. There are several ways

to make a prediction: Fixed prediction makes always the same guess, either taken or not taken,

and is considered as a one-outcome guess; True prediction has two possible outcomes and can

be static if the prediction depends only on the code in question, or dynamic if the prediction

depends on the execution history.

2.3.2 E�ect of Data Dependences in Pipelined Execution

Each stage of execution receives data from its previous stage and furnishes data to its next

stage. Such data �ow introduces data dependences between instructions being executed in the

pipeline, limiting the processor's performance. For example, Figure 2.9 illustrates the case of

a RAW dependence in a four-stage pipeline. The ADD instruction adds 4 to R0 and store the

result in R1. The next instruction, MUL, reuses R1 as a source operand. At the end of cycle

T3, the MUL instruction has to read the source operands in the register �le; however, the result

of the addition is still not written, and then the hardware should stall the pipeline for one cycle.

Some pipeline stalls due to RAW dependences can be eliminated by a mechanism called


T1 T2 T3 T4 T5

CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�

test.eq ...�VVVVVV�VVVVVV�VVVVVV�....................Fetch Decode Alu-WB

jt/jnt ...........�VVVVVV�VVVVVV�....................Fetch New PC

Instr.3 ...................�UUUUUU�VVVVVV�VVVVVV�....Fetch Decode Alu-WB

if the conditional branch is taken, squash Instr. 3

Figure 2.8: Illustration of the control dependencies in a three-stage pipeline.

T1 T2 T3 T4 T5 T6

CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HHH

ADD R1, R0, 4 ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�...................F D E WB

MUL R2, R1, 12 ...........�VVVVVV�VVVVVV�UUUUUU�VVVVVV�VVVVVV�...F D Stall E WB

Wait for R1

Figure 2.9: Pipeline stall due to a RAW data dependence.

bypassing. This latter is illustrated in Figure 2.10. When a RAW dependence is detected by

the hardware, instead of reading the source operand in the register �le, the result is directly

transmitted to the execution unit. In our example, the hardware detects the RAW dependence

during T3, and the result of the addition is bypassed to the multiplication, at the end of phase

T3.

Bypassing can eliminate all the RAW dependences having a one-cycle distance as in Fig-

ure 2.10. However, when an instruction has a latency of several clock cycles, which is generally

the case for loads, it is impossible to eliminate the RAW dependences with bypassing. In this

case the pipeline should be stalled or some delay slots must be inserted. Figure 2.11 shows how

a delay slot can be used to avoid pipeline stalls. An instruction, Instr. 1, should be inserted

between the two dependent instructions, LD.32 and MUL, with the condition that the Instr. 1

must not use the destination of the load. By this way, the load latency is masked. However,

it is not always possible to move a useful instruction into the delay slot, and sometimes the

compiler or the programmer should add a NOP. Note that bypassing is still used to reduce of

one cycle the data latency.

2.3.3 Resource Con�ict in Pipelined Execution

Figure 2.11 illustrates a common case of resource con�ict occurring between two instructions.

As the load operation has a latency one clock cycle higher than other instructions, the load

2.4 Superscalar Architectures 11

T1 T2 T3 T4 T5

CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�

ADD R1, R0, 4 ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�............F D E WB

MUL R2, R1, 12 ...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�....F D E WB

Bypass the result

Figure 2.10: Result bypassing avoids the pipeline stall due to one-cycle RAW data dependences.

CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HHH

LD.32 R1, (R0) ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVVVV�...........F D Addr Mem WB

Instr. 1 ...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�...........F D E WB

MUL R2, R1, 12 ...................�VVVVVV�VVVVVV�VVVVVV�VVVVV�....F D E WB

Instr. 1 should not use R1

The load value is bypassed

Figure 2.11: Delay slot due to a one-cycle load latency.

and its following instruction do the write back at the same time. If the processor has a limited

number of write register ports, this can result in a resource con�ict. In the case of Figure 2.11,

this is solved by allowing two simultaneous write backs, i.e., having two write register ports

instead of one. The other solution is to stall the pipeline when there is a resource con�ict,

implying a loss in performance.

2.4 Superscalar Architectures

Pipelined machines use ILP in an horizontal way: they execute several instructions in parallel

but instructions are still issued sequentially. Superscalar architectures enhance the basic pipeline

model by allowing the execution and issue of multiple instructions per clock cycle. Superscalar

processors fetch several instructions each clock cycle, analyze dependences between operations,

and dispatch the instructions to several functional units. All these operations are performed

by the processor's hardware, resulting in an overhead in circuit complexity, but conferring the

ability to be compatible with the code generated for the non-superscalar processors. Figure 2.12

shows the timing execution of a generic superscalar processor able to fetch, decode, and execute

up to two instructions in parallel.

Figure 2.13 shows a block diagram of a generic four-unit superscalar processor. At each

clock cycle the processor fetches up to four instructions into the instruction cache, and transmits

them to the decoder. Also, during the fetch the processor predicts the address of the next block


CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�H

Instr. 1 ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�.........F D E WB


Instr. 3 ...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�.F D E WB


Figure 2.12: Execution timing of a generic two-issue superscalar processor.

of instructions. When the decoder receives the operations, it decodes the four instructions and

sends them to an instruction bu�er called Instruction Window or Reservation Station. Data

dependences are computed between all the operations being stored in the Instruction Window.

Depending on the dependences and on the resource availability, the processor sends the di�erent

instructions to the functional units. Finally, the results are written into the register �le through

a reorder bu�er. This latter is used to support out-of-order execution, which will be described

in the next sections.

With the superscalar model of execution the processor's performance not only depends

of the width of the pipeline (i.e. number of functional units), but also strongly depends on

processor's policy toward fetching, decoding and executing instructions, called instruction-issue

policy. The instruction-issue policy limits or enhances lookahead capability, and therefore the

ability to �nd independent instructions beyond the current point of execution.

The following sections describe and compare three di�erent instruction-issue policies. The

comparison is made through the same example, that can be found in [33], where six instructions

are executed according to the di�erent instruction-issue policies. The instruction sequence, from

i1 to i6, has the following constraints on parallelism:

� i1 requires two cycles to execute,

� i3 and i4 con�ict for a functional unit (edge 1 Figures 2.14, 2.15, and 2.16),

� i5 depends on the value produced by i4 (edge 2 Figures 2.14, 2.15, and 2.16),

� i5 and i6 con�ict for a functional unit (edge 3 Figures 2.14, 2.15, and 2.16).

To help to visualize the operation of the superscalar processor, the �gures show the pro-

cessor pipeline stage horizontally and show the clock cycles vertically.

2.4.1 In-order Issue with In-order Completion

The simplest instruction-issue policy is to issue instructions in the original program order (in

order issue) and to write the results in the same order (in-order completion). For accomplishing


ALULoad/Store

Buffer

ALU

MemoryData

Instruction Windowor

Reservation Station

Cache

Scheduling

Data

InstructionFile

PredictionBranch

Cache

MemoryInstruction

RegisterDecoder ReorderFETCH

Branch Unit

DataDependencies

Figure 2.13: Block diagram of a generic four-unit superscalar processor.

this, instruction issuing stalls when there is a resource con�ict or when a functional unit has

a result latency greater than one cycle. Figure 2.14 illustrates the in-order issue with in-order

completion policy. During cycle 3 the pipeline stalls because i1 requires two cycle to execute,

and during cycle 5 and cycle 7 the stalls are caused by the resource con�icts i3 ! i4 and i5 !

i6. These stalls prevent the processor from fetching new instructions, and therefore limits its

lookahead capabilities.

In-order issue with in-order completion has an inherent simplicity; however, the system

has a relatively low level of achievable performance. It is why this scheduling policy is generally

not used, even in scalar processors.

2.4.2 In-order Issue with Out-of-order Completion

A �rst step to improve performance is to allow out-of-order completion. Thanks to this new

degree of freedom, the pipeline should not stall when a functional unit needs more than one

cycle to generate a result. Figure 2.15 illustrates how instruction i2 completes out of order,

allowing to overlap the execution of i1 and i3 during cycle 3.

For processors supporting in-order issue with out-of-order completion, the pipeline stalls

when there is a resource con�ict or when an issued instruction depends on a result that is not

yet computed. Furthermore, output dependences should be taken into account, because two

instructions having the same destination register can not be completed out of order.

Out-of-order completion yields higher performance than in-order completion but requires

more hardware. Dependences should be checked between decoded instruction and all instruc-

tions in all pipeline stages. Also, hardware must insure that the result are written in the correct

order, to insure the register �le coherency.

A new problem introduced by the out-of-order completion is the exception handling. Some-


i1 i2

i3 i4

i5 i6

12345

8

Writeback

6

Cycle

7

Execute

i1

i3i4

i5i6

i1i2

Stall

Stall

i6

i3i4

resource conflictflow dependenceresource conflict

i6i5

i4i4

Decode

i1i3

i2

1

321

2

3

Figure 2.14: Superscalar pipeline with in-order issue and in-order completion.

times, an instruction creates an exception. Once the exception routine has been completed, it

is necessary to restart the program execution so that it can continue as usual. The problem is

that the exception may have been detected as an instruction produced its result out of order.

Therefore, it is not possible to restart the program at the instruction following the excepting

instruction because subsequent instructions have already completed, and doing so will cause

this instruction to be executed twice.

Stall

i6

i4i6

i3

i5

i4


i2i1

i2i1

Decode Execute

i1i3

i5

Cycle

i4i5i6

Writeback

i1i2

i3i4

i6

4

123

5

76

321

3

21

Figure 2.15: Superscalar pipeline with in-order issue and out-of-order completion.

2.4.3 Out-of-order Issue with Out-of-order Completion

With in-order issue the processor's lookahead abilities are limited because the decoder stalls

when there is a resource con�ict, a �ow dependence, or an output dependence between uncom-

pleted instructions. Therefore, the processor is not able to look beyond instructions with the

con�ict or dependence, even though subsequent instructions might be independent.


To surmount this problem, an instruction bu�er called the instruction window, is inserted

between the decode and execute stages. The instruction window is used as a pool of instructions,

allowing the processor to fetch instructions until the instruction window is full. Then, the

lookahead capability is only constrained by the width of the instruction fetch and by the size

of the instruction window. Operations can be issued from the instruction window and can be

executed out of order. The only constraint is to insure the correct program behavior.


i4, i5, i6

i1, i2

i5

i3, i4

Decode

i1

i6i6

i6

123456

Cycle

i2

i3i4

i3 i4

Window Execute

i1 i2i1

i5

i2i3

Writeback

i1i4i5

i5

3

2 1

321

Figure 2.16: Superscalar pipeline with out-of-order issue and out-of-order completion.

Figure 2.16 shows the operation of a superscalar pipeline with out-of-order issue. Note that

the instruction window is not an extra pipeline stage, it is simply a bu�er where the decoder

can store instructions. By bu�ering instructions, the decoder is able to operate at a maximum

rate. This allows the processor to �nd more independent instructions. In our example, the

independent instruction i6 is issued out of order, concurrently with i4.

Compared to the in-order issue with out-of-order completion, out-of-order issue has to

deal with one more type of dependences, the anti-dependences. Therefore, the processor has to

insure that an instruction executed out of order does not prematurely modify a register.

2.4.4 Exception Recovery and Register Data�ow in Superscalar Processors

Aggressive scheduling policies are required to increase performance of superscalar processors.

However, techniques such as out-of-order issuing or completion introduce new problems in terms

of exception and instruction dependences handling.

High instruction throughput is obtained in superscalar processors by fetching and issuing

operations under the assumption that branches are correctly predicted. Such techniques require

a recovery and restart mechanism to insure the correct execution when a branch is mispredicted

or when an instruction cause an exception. To handle such cases, the processor maintains an

execution history with the following states [33]:

� The in-order state, composed of the most recent assignments performed by the longest

continuous sequence of completed instructions.

� The lookahead state, composed of the all assignments, starting with the �rst uncompleted

instruction, to the end of the sequence.

� The architectural state, composed of the most recently completed and pending assign-

ments to each register, relative to the end of the known instruction sequence (i.e., fetched

instructions).


Figure 2.17 illustrates theses three processor states. Note that instruction (2) and (6) were

deleted from, respectively, the in-order state and the architectural state because they are not

the most recent assignment in their corresponding state.

R3 := ...(6)

R8 := ...(3)R7 := ...(4)R4 := ...(5)

R8 := ...(7)R3 := ...(8)

R7 := ...(2)

Completed

SequenceInstruction

R3 := ...(1)

R8 := ...(3)R7 := ...(4)

R7 := ...(2)

instructions

StateIn-order

R4 := ...(5)R3 := ...(6)R8 := ...(7)R3 := ...(8)

StateLookahead

R7 := ...(4)R4 := ...(5)

R8 := ...(7)R3 := ...(8)

R3 := ...(6)

StateArchitectural

R3 := ...(1)

Figure 2.17: In-order, lookahead, and architectural state for an out-of-order issue superscalar

processor.

One classical approach to store these di�erent states is to add a reorder bu�er [60] (see

Figure 2.13) in the processor. In this case, the register �le contains the in-order state and the

reorder bu�er stores the lookahead state. The architectural state is obtained by combining the

in-order and the lookahead state. Other variants of the recovery mechanism, such as history

bu�er, or reorder bu�er with a future �le, can be found in [33].

The other problem introduced by the out-of-order policy is that the anti and output de-

pendencies can limit the performance of the processor. As it is described in subsection 2.2.1,

register renaming can eliminate these kinds of dependences, and can be implemented in hard-

ware. For example, processors that have a reorder bu�er and use an associative lookup table

to form the architectural state provide a straightforward implementation of register renaming

[33].

2.5 Very Long Instruction Word Architectures

Superscalar processors with out-of-order execution achieve higher performance than scalar pro-

cessors. However, the drawback of the superscalar technique is the increase in circuit complexity.

Indeed, dependence checking, dispatch unit, instruction window, exception recovery mechanism,

branch predictor, multi-ported register �les, and reorder bu�er introduce a substantial circuit

overhead.

Very Long Instruction Word (VLIW) architectures are an alternative solution to exploit

ILP with a lower circuit overhead than superscalar processors. Similar to superscalar proces-

sors, VLIW architectures execute and issue more than one operation per clock cycle. However,

scheduling and data dependence analysis are moved from the hardware level to the compiler

level, resulting in an important decrease in circuit complexity. Figure 2.18 shows the execution

timing of a two-issue VLIW pipeline. The main di�erence with a superscalar pipeline (Fig-

ure 2.12) is that VLIW architectures fetch only one instruction (a large one) per clock cycle,

that encodes parallel operations designated for the di�erent functional units.

Figure 2.19 illustrates the block diagram of a generic four-issue VLIW processor. Compared

to the superscalar processor (Figure 2.13), there is no need for dependence analysis, instruction

window, and a hardware scheduler. The VLIW processor has exactly the same behavior as a

2.6 Compiler Techniques to Extract ILP 17

CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�H


...................�VVVVVV�VVVVVV�.........E WB


...........................�VVVVVV�VVVVVV�.E WB

Figure 2.18: Execution timing of a generic two-issue VLIW processor.

scalar processor: it fetches an instruction, decodes, and then executes it. However, the execution

can involve several functional units.

An example of the VLIW compiler tasks is shown in Figure 2.20 using the VLIW processor

of Figure 2.19. The dependence graph (Figure 2.20(b)) is computed from the sequential code

(Figure 2.20(a)), and shows how operations can be executed in parallel:

� At �rst, the two load operations can be executed in parallel,

� then, the shl and the add,

� and �nally, the sub.

However, VLIW processors have a limited number of resources and their schedulers have

to take into account such constraints. Figure 2.20(c) shows how four large instructions are

formed from the original code sequence according to the dependence graph and the resource

constraints. For example, the two loads are scheduled in two di�erent instructions because

there is only one load/store unit, and therefore the loads must be executed in sequence. Also,

when the scheduler is not able to �nd a su�cient number of independent operations, NOPs are

inserted explicitly, resulting in an increase in code size. Current VLIW processors, such as the

Texas Instrument 32C6201 [74] or the HP/Intel IA-64 [25] have special encoding mechanism

that reduces the extra NOPs insertion cost.

VLIW processors performance strongly depends on the capability of the compiler to extract

parallelism from a sequential program. Such compiler techniques play also an important role

for superscalar architectures, by breaking instruction dependences, and therefore giving more

opportunities to the processors to �nd parallelism in between instructions. The next section

gives a brief description of ILP techniques.

2.6 Compiler Techniques to Extract ILP

This subsection introduces some major compiler techniques to generate code for ILP architec-

tures. The main goal of such techniques is to break the barriers introduced by the instruction


MemoryInstruction

CacheInstruction

Data Data

Prediction

ALU

MemoryCache

ALU

RegisterFile

Load/Store

Branch

FETCH

Decoder

Very Large Instruction

Branch Unit

Figure 2.19: Block diagram of a generic four-unit VLIW processor.

dependences. Although there is an abundant amount of research in ILP, this section only gives

an introduction to the concepts relevant to this work.

2.6.1 Basic Block Scheduling

In traditional sequential representation of programs, the code is composed of basic blocks (BB).

A basic block is a sequence of instructions that does not contain a branch (except for the last

operation) or a branch target (except for the �rst operation), and has the property that if one

instruction of the BB is executed, all other instructions are also executed. Figure 2.21 shows

how a program can be divided into basic blocks and represented by a control �ow graph.

Basic block scheduling consists in limiting the compiler scope to a basic block for the

parallelization of the instructions. This is a very simple algorithm and the performance im-

R0 R1

R2 R3

R4

add

ld ld

shl

sub

(c)(a)

(b)

(1)

(4)

(2)

(3)

add

nop nop

nop

nop

nop

shl

sub

ld

ld

nop

nop

nop

nop

nop

nop

(3) add r2, r0, 1(4) shl r3, r1, 1

(2) ld r1, label_y

(5) sub r4, r2, r3

(1) ld r0, label_x

Figure 2.20: Example of formation of VLIW instructions: (a) sequential code, (b) the corre-

sponding dependence graph, (c) the corresponding VLIW code.


E;

A;

D;

} else {

do {

}G;H;

} while (I)

if (C) {B;

F;F;

D;E;

jmp _L3

A;

(a) (b) (c)

A;_L1: B;

br C, _L2D;E;

F;_L2:G;H;

_L3:

jmp _L3

br I, _L1

B;br C, _L2

H;G;

br I, _L1

Figure 2.21: Control �ow graph with basic blocks: (a) original C code, (b) corresponding

assembly code, (c) corresponding control �ow graph.

provement is generally limited. Indeed, basic blocks contain only a few instructions, limiting

opportunities to �nd independent instructions. To override this limitation other techniques

such as trace scheduling [23], superblock scheduling [45], and hyperblock scheduling [44] that

enlarge the compiler scope have been proposed.

2.6.2 Superblock Scheduling

Superblock scheduling as well as trace scheduling focus on applying global optimization in favor

of the most frequently executed path. Trace scheduling divides functions in a set of traces that

represent the frequently used paths. These traces may contain several conditional branches

that go out of the trace (side exits) and several branch targets in the middle of the trace (side

entrance). Instructions are scheduled within each trace ignoring these control-�ow transitions.

After scheduling, bookkeeping is required to ensure the correct execution of the o�-trace code.

The major disadvantage of this technique is the increase in the compiler complexity due to

bookkeeping.

Superblock scheduling is derived from trace scheduling and aims to reduce the compiler's

complexity while o�ering an e�ective technique to extract ILP from a program. A superblock is

a trace with no side entrance. Figure 2.22 shows how superblocks are formed from the original

weighted �ow graph (2.22(a)). From this latter, the most frequently executed trace is formed

(2.22(b)). Finally, the side entrance is eliminated using tail duplication [20](2.22(c)).

Before superblock scheduling is performed, ILP optimizations are applied to enlarge the

compiler scope and to remove dependences. Enlarging optimizations are:

� Branch Target Expansion: branch target expansion expands the likely taken control trans-

fer which ends the superblock. The target superblock is copied and appended to the end

of the original superblock.

� Loop Peeling: superblock loop peeling is applied to loops that iterate, according to pro-

�ling information, only a few times. The loop body is replaced by straight-line code

consisting of the expected number of iterations. The original body of the loop is moved to


E;jmp _L3

90%10%

D;

brn C, _L2

90%

H;G;

brn I, _L1

10%

B;

D;E;

F;

A;A;

B;brn C, _L2

brn I, _L1H;G;F;

Side entranceH;G;

brn I, _L1

E;D;

A;

jmp _L3

B;brn C, _L2

brn I, _L1H;G;F;

(c)(a) (b)

Figure 2.22: Superblock formation: (a) weighted �ow graph, (b) trace formation, (c) tail du-

plication.

the end, to handle the case when the loop should be executed more times than expected.

(see Figure 2.23(b))

� Loop Unrolling: superblock loop unrolling is applied to loops that tend to iterate many

times. To unroll a loop N times, N-1 copies of the superblock are appended to the original

superblock. (see Figure 2.23(c))

BB1

BB2

BB2

BB2

BB2

BB2

BB2

BB1

BB2

BB2

BB1

(a) (b) (c)

Figure 2.23: Loop enlarging optimizations: (a) original loop, (b) loop peeling, (c) loop unrolling.

Once the superblocks are enlarged, some optimizations are applied to eliminate depen-

dences between instructions. Some of these superblock dependence removing optimizations

are:

� Register renaming: eliminates arti�cial dependences such as anti and output dependences

(see Section 2.2.1).

� Accumulator variable expansion: an accumulator variable accumulates a sum or a product

at each iteration of a loop. Anti, output, and �ow dependences between instructions which


accumulate a total are eliminated by replacing each de�nition of accumulator variable (see

Figure 2.24(a!b) variable s).

� Induction variable expansion: induction variables are used within loops to index through

loop iteration and through regular data structure such as arrays. Due to the dependence

on induction variable computation, ILP is typically limited when loops are unrolled. In-

duction variable expansion eliminates rede�nition of induction variables by creating a new

variable for each de�nition of the induction variable, thereby eliminating all anti, output,

and �ow dependences among the induction variable de�nitions (see Figure 2.24(b!c)

variable i).

goto L1

s=0L1: if (i > n) goto exit

s=s+a[i]i=i+1 ite

r. 1

if (i > n) goto exit

i=i+1 iter.

2

s=s+a[i]

i=0

exit: m = s/i

L1:s2=0s1=0

goto L1

iter.

2ite

r. 1if (i > n) goto exit

s1=s1+a[i]i=i+1if (i > n) goto exits2=s2+a[i]i=i+1

exit:m = s/is=s1+s2

i=0

(b)(a)

L1:s2=0s1=0

goto L1

iter.

2ite

r. 1if (i1 > n) goto exit

s1=s1+a[i1]i1=i1+2if (i2 > n) goto exits2=s2+a[i2]i1=i2+2

exit:i = i1 + i2s=s1+s2

i2=1i1=0

(c)m = s/i

Figure 2.24: Dependence removing: (a!b) accumulator variable expansion, (b!c) induction

variable expansion.

Note that induction and accumulator variable expansion add extra instructions outside of

the loop body.

After ILP optimizations are applied, depending on dependences and resource availability,

superblock scheduling is performed. The scheduler can move instructions above a preceding

conditional branch within a superblock using a technique called speculation. Instruction specu-

lation breaks some of the control dependences that are in the superblock, resulting in an increase

in ILP. However, there are restrictions that limit speculation. These restrictions are, if I is the

speculated instruction and B is the conditional branch instruction where I is moved above:

� Restriction 1: the result of I must not be used before it is rede�ned when B is taken.

� Restriction 2: I must never cause an exception that may terminate program execution

when branch B is taken.

Restriction 2 is probably the most important constraint: exceptions caused by speculative

instructions which would not have been executed in the original program must be ignored.

Several hardware support were proposed to handle speculation of potentially trapping instruc-

tions such as loads, stores, or divides. The restricted percolation model includes no support for

disregarding the exceptions generated by the speculative instructions. Therefore, the compiler

can not move instructions that can potentially cause an exception above a branch. The main

limitation of the restricted percolation model is the inability to move potential trap-causing


instructions with long latency, such as load operations, above branches. To overcome this lim-

itation the general percolation model eliminates the restriction 2 by providing a non-trapping

version of instructions that can cause exceptions. The non-trapping version is used when the

instruction is speculated. For programs in which detection of exceptions is important, sentinel

scheduling [41] allows, with additional hardware and compiler support, to handle exceptions

generated by speculated instructions.

2.6.3 Predicated Execution

Conditional branch instructions introduce control dependences (see subsection 2.2.2) that are

recognized as a major impediment to exploiting ILP. Branch prediction and instruction specula-

tion are techniques that reduce the e�ects of control dependences; however, conditional branches

can result in severe performance penalties due to mispredicted branches.

Predicated execution allows conditional execution of instructions based upon a computed

condition and may be supported by several di�erent architectural models [43]. Each model must

support a method of expressing the condition and a method for the condition to a�ect instruc-

tion execution. Full predication supports this using new instruction set and microarchitecture

extensions.

The full predication model consists of four components: a predicate register �le for holding

1-bit predicate values, an additional source operand for each instruction to specify a predicate

for instruction execution, a conditional-execution stage to nullify instructions, and a set of

predicate de�ning instructions for generating conditions. The values in the predicate register

�le are associated with each instruction through the use of an additional source operand, or

predicate operand. This operand speci�es which predicate register will determine whether the

instruction should execute. A predicate register value of 1, or true, indicates that instruction

is executed; a value of 0, or false, indicates that instruction is suppressed. An unconditional

instruction is designated by a predicate register that is always true. An architectural support

for predicated execution can be found in the HPL PlayDoh Architecture Speci�cation [22].

<p2>D = A + X <p2>

Z = Z - 1 <p1><p2>A = A + 1

} {else

C = C - 1<p2>C = C - 1<p1>

beq A, B

X = X + 1D = A + X

C = C - 1Z = Z - 1

A = A + 1

Z = Z - 1

(a)

B = B + 1;

(b)

B = B + 1

D = A + X

p1 = (A == B)p2 = (A != B)X = X + 1 <p1>

<p2>A = A + 1D = A + X

(d)

B = B + 1

p1 = (A == B)p2 = (A != B)

(c)

X = X + 1 <p1>D = A + X <p1>

B = B + 1

if (A == B)X = X + 1;

{

Z = Z - 1;

A = A + 1;

C = C - 1;

D = A + X

D = A + X

}

Figure 2.25: (a) A simple if-then-else C code construct, (b) unpredicated code, (c) predicated

code, and (d) optimized predicated code.

Predication support allows the compiler to use an if-conversion algorithm to convert con-

ditional branches into predicate de�ning instructions, and instructions along alternative paths

of each branch into predicated instructions [48]. Figure 2.25 demonstrates the limitation of the

traditional control �ow graph when applied to predicated code. A simple if-then-else construct

is shown in Figure 2.25(a). The code generated for this segment without predication is shown

in Figure 2.25(b). Here the control �ow graph clearly shows that one and only one side of the

if-statement may execute. The predicated code control �ow graph is shown in Figure 2.25(c).

2.7 Conclusion 23

In this case all the code falls into one basic block because there is no possibility of branching

until the end of the set of instructions.

The most notable modi�cation of predication to the instruction set encoding format is

the addition of the predicate operand source for every instruction. The predicate operand

increases the instruction size and has signi�cant e�ects on overall program code size. One

model [51] proposes a new set of predicate guarding instructions that would reduce the drawback

of existing methods of specifying predicated execution through the use of predicate mask-setting

instructions. Although the mechanism is useful in reducing the predicate operand overhead, the

general mechanism constrains several aspects of predicated execution and dramatically alters

the instruction issue logic of microprocessors.

There are two major bene�ts associated with applying if-conversion. First, a compiler can

eliminate problematic branches from the program. In doing so, all the associated overhead with

these branches is removed, including misprediction penalties, penalties for redirecting sequential

instruction fetch, and branch resource contention. Second, predication facilitates increased ILP

and speedup by allowing separate control �ow paths to be simultaneously executed.

2.7 Conclusion

This chapter gives a background of instruction-level parallelism. First, the main concepts of

ILP have been introduced. Second, several architectures that exploit ILP have been described.

Finally, the required compiler support for such architectures has been presented. This chapter

represents only a brief survey of some of the major ILP techniques: it describes the main notions

required to understand the rest of this work.

Instruction-level parallelism is mainly used to increase processor's performance; however,

parallelism can also be used to increase the energy e�ciency of a system. The following chapter

describes how parallelism can be used in a low-power context.

Chapter 3

Power Consumption in CMOS Circuits

CMOS design exhibits a good trade-o� between circuit area and power consumption.

This is why the majority of current processors are implemented in CMOS technology.

However, as circuits becomes more and more complex, there is a steady increase in

power consumption, making power consumption the new major constraint of circuit design.

This chapter gives a general background of the power consumption in CMOS circuits.

First, the di�erent source of power consumption and their relative contribution are described.

Second, several metric are introduced and their meaning is explained. Finally, it is explained

how parallelism can be used to improve the more energy e�ciency of an architecture.

3.1 Sources of Power Dissipation in CMOS Circuit

The sources of power consumption of a CMOS circuit can be classi�ed as: the static power

dissipation, that is related to the logic state of the circuit and is due to the leakage currents

and other static currents; and the dynamic power dissipation, that is caused by the switching

activities of the circuit and is due to the short circuit currents, and the charge and discharge

of the load capacitance.

These sources of power consumption are described in the following subsections through the

example of a static CMOS inverter. Figure 3.1 shows di�erent representations of such inverter.

XX XX

Vdd

Gnd

Cload

PMOS

NMOS

X

(c)

Cload

Vdd

Gnd

X

(a) (b)

Figure 3.1: Static CMOS inverter: (a) gate, (b) transistors, and (c) switches representation.

25

26 Power Consumption in CMOS Circuits

3.1.1 Static Power Dissipation

Ideally, CMOS circuits have no static power dissipation because there is no direct path from Vddto Gnd. However, CMOS transistors do not behave as perfect switches, and generate leakage

currents that can arise from reverse bias diode currents and sub-threshold e�ects. These e�ects

are primarily determined by fabrication technology considerations.

Another source of static dissipation can appear when deviations from CMOS style circuit

design are used. For example, the pseudo NMOS logic circuit can be useful in the register �le

design due to e�cient area usage. Indeed, pseudo NMOS circuit does not require a P-transistor

network and saves half the transistors required for logic computation compared to the CMOS

logic. The main drawback of such a technique is that, depending of the output value, there is

a direct path from Vdd to Gnd. Therefore, a trade-o� between area and power consumption

should be made.

Static power dissipation represents less than 10% of the total power dissipation [19], and

therefore it does not represent the main target for power consumption reduction. However,

current microprocessors tend to have a very low power supply voltage, resulting in a low tran-

sistor threshold voltage, Vt. Such diminution implies an increase of the static currents, and

consequently the static dissipation contribution can be much more signi�cant, especially during

sleep modes.

3.1.2 Dynamic Power Dissipation

Dynamic power dissipation is the main source of power consumption in CMOS circuits (around

90% [19]). Dynamic dissipation comes from the switching activity of the circuits and has two

main components caused by the short-circuit currents, and by the charge and discharge of the

load capacitance.

Short-circuit power dissipation � Since NMOS and PMOS transistors do not behave

as perfect switches and do not commute exactly at the same time, there is a direct path from

Vdd to Gnd during a change of state. For example, during the transition of a CMOS inverter

(Figure 3.1), both transistors areON for a small amount of time . This phenomenon is illustrated

in Figure 3.2. The contribution of short-circuit current strongly depends on the time that both

Sho

rt-c

ircui

t cur

rent

P-transitor ON

N-transitor ON

Time

Time

Inp

ut v

olta

ge

Figure 3.2: Short-circuit current in a static CMOS inverter.

P and N transistors are ON, and therefore depends on the signal slope. Generally this mode of

3.2 Metrics for Energy E�ciency 27

power dissipation is 10-60% of the total power dissipation [80]; however, with a careful design

it can be kept below 15% [21].

Charge/Discharge capacitance power dissipation � The power consumption due

to the charge and discharge of the load capacitance dominates the total power dissipation.

Figure 3.3 shows an example of such situation for a static CMOS inverter. When the input

changes from 1 to 0 (Figure 3.3(a)) the load capacitance (Cload) is charged through the PMOS

transistor by a charging current (Icharge). The power supply has to deliver the required energy

to charge the capacitance:

E = Cload � V2

dd(3.1)

Half of this energy is dissipated by the PMOS transistor and the other half is stored in the

Cload capacitance. Then, when there is a transition of the input from 0 to 1, the energy stored

in Cload is dissipated through the NMOS transistor by a discharge current, Idischarge. In this

case the power supply does not need to furnish any additional energy. Therefore, the average

energy of a transition is:

Eavg =1

2� Cload � V

2

dd(3.2)

Icharge

0 to 1

Gnd

Vdd

(b)(a)

Vdd

Gnd

0 to 11 to 0

Cload

Idischarge

Cload1 to 0

Figure 3.3: Charge and discharge of the load capacitance in a static CMOS inverter.

Considering that the system works at frequency f and that the output has an activity

�, corresponding to the average number of times that Cload is charged or discharged per clock

cycle, then the resulting power consumption is the average dynamic power dissipation:

Pavg =1

2�Cload � � � f � V

2

dd=

1

2� Csw � f � V

2

dd(3.3)

where Csw = Cload � � is the switched capacitance.

As the charge/discharge capacitance power dissipation is known as the main source of

power dissipation [80][21], the following sections focus on this part of the power consumption.

3.2 Metrics for Energy E�ciency

There are many di�erent ways to measure the power or energy e�ciency leading to di�erent

results when comparing di�erent systems. This subsection introduces these di�erent metrics

and explains their meaning.


Power Dissipation, P � Power dissipation, measured in watt, is probably the most

straightforward way to measure the power e�ciency of a circuit. This metric can be useful for

packaging consideration, power supply dimensioning, cooling requirements, signal noise, and

reliability of the system. However, power dissipation depends on the clock frequency: a chip

running at a higher frequency improves its performance, but also increases its power dissipation.

Therefore, this metric is not useful in comparing the energy e�ciency of a chip, because a circuit

does not become more energy e�cient if it changes its clock frequency.

Energy Consumption E, or Power-Delay Product � As power dissipation represents

the rate at which energy is consumed, energy consumption, measured in joules, is another

alternative metric. This metric is useful when a system has to work at a �xed throughput

[16]. Thus, as showed in Subsection 3.3, the power supply voltage and the clock frequency are

correlated parameters, and can be scaled to meet the time requirements of the application. In

this case, the energy consumption of the system can be used to do comparisons, because an

architecture A consuming less energy than an architecture B to execute a task in a �xed amount

of time also dissipates less power.

The energy consumption of a microprocessor is often measured in �W/MHz, and represents

power dissipation per clock cycle. However, such a metric can be misleading when the compared

processors have a di�erent instruction set or architecture, because the number of instructions

and the number of clock cycles needed to execute each instruction can be very di�erent. Million

Instruction Per Second (MIPS) can be used to normalize the energy and compare processors

that have the same instruction set. The corresponding unit is �W/MIPS or, more commonly

found, its inverse MIPS/�W.

MIPS are not suitable to compare the performance of processors having a di�erent in-

struction set architecture, because the number of instructions needed to execute a task can be

very di�erent. Therefore, another performance metric should be used to normalize the energy:

for example SPEC numbers, resulting in �W/SPEC or its inverse SPEC/�W. SPEC numbers

correspond to the time needed to execute the SPEC benchmark suite. For other benchmarks,

the metric can be speci�ed as the power-delay product E = Pavg �Texec, where Texec is the time

needed to execute a task, and Pavg is the average power dissipation during the execution of this

task.

Energy-Delay Product, EDP � When a task needs to be executed at a maximum

speed, the power-delay product (or energy) becomes a misleading metric for comparing micro-

processors. Indeed, two processors may need the same energy to execute a task, while having

a di�erent energy consumption distribution, meaning that one of them can be n times faster

when dissipating n times more power. This latter is better in terms of energy e�ciency when

maximum performance is required. It is why the energy-delay product, EDP = E � Texec, was

proposed in [24]. A commonly found equivalent metric, that correspond to the inverse of the

EDP , is the MIPS2/�W or SPEC2/�W.

3.3 Parallelism for Energy E�ciency

Since the main source of power dissipation is quadratically related to the power supply voltage

Vdd, an often employed power consumption reduction technique is to scale down Vdd. However,

signal delays in CMOS circuit also depend on Vdd. Equation 3.4 shows that a reduction in Vddwill cause a decrease in the working frequency, which in turn will degrade the performance of

the overall system:

3.3 Parallelism for Energy E�ciency 29

Tdelay = K �Vdd

(Vdd � Vt)�; (3.4)

where Tdelay is the circuit delay, K is a technology and circuit implementation dependent

constant, Vt is the threshold voltage, and � is equal to two for micronic technology and decrease

when the technology becomes submicronic (� is around 1.5 for a 0.25� technology).

Parallelism [19] is one technique which can compensate the loss of performance due to

reduced clock speed. Indeed, parallelism enables a system to work at a lower frequency while

having the same performance as the equivalent non-parallel system running at a higher fre-

quency.

Figure 3.4 shows the result of a Spice simulation in terms of circuit delay and power

consumption of a simple circuit implemented in a 0.25� TSMC CMOS technology. The circuit

delay and energy consumption are reported for di�erent values of Vdd and are relative to a

reference voltage of Vdd = 2:5 volts. These graphs show that if parallelism can compensate a

loss in performance of a factor of two, the voltage can be down-scaled from 2.5 volts to around

1.4 volts, resulting in a power saving of around 70% if the overhead due to the parallelization

is neglected.

1 1.5 2 2.50.5

1

1.5

2

2.5

3

3.5

4

4.5

Power supply voltage: Vdd

Rel

ativ

e de

lay,

Vre

f=2.

5 vo

lts

1 1.5 2 2.50.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Power supply voltage: Vdd

Rel

ativ

e en

ergy

con

sum

ptio

n, V

ref=

2.5

volts

Figure 3.4: Relative circuit delay (left) and relative energy consumption (right) as function of

Vdd.

To help to understand this concept, Figure 3.5 shows a qualitative example of how par-

allelism in conjunction with voltage and clock frequency down-scaling can improve the energy

e�ciency of a processor while keeping performance at the same level. The energy distribu-

tion of several processor con�gurations is represented to help to understand how this kind of

optimization works. The di�erent con�gurations are:

� Con�guration P1+: a one-issue processor with Vdd = V+ and f = f+ (Figure 3.5(a))


Ec

Configuration P2+ Configuration P2-

Configuration P1-Configuration P1+

Pow

er D

issi

pa

tion

Pow

er D

issi

pa

tion

Pow

er D

issi

pa

tion

Pow

er D

issi

pa

tion

Execution Time

Execution Time Execution Time

Execution Time

Ta

Pa

Ea

Pb

Pd

Pc

Tc

Tb

Td

Eb

Ed

f-, V-

(b)(a)

(c) (d)

Voltage and frequency down-scaling

Para

llelis

m

f+, V+

P2: t

wo

-iss

ue p

roc

ess

or

P1: o

ne-i

ssue

pro

ce

sso

r

Figure 3.5: Energy distribution of several processor con�gurations executing the same task.

� Con�guration P1�

: a one-issue processor with Vdd = V�

and f = f�

. (Figure 3.5(b))

� Con�guration P2+: a two-issue processor with Vdd = V+ and f = f+ (Figure 3.5(c))

� Con�guration P2�

: a two-issue processor with Vdd = V�

and f = f�

(Figure 3.5(d))

Where V�

< V+ and f�

< f+.

P1+ and P2+, if no circuit overhead is associated to the parallel architecture, expend the

same energy when executing the same task, but the energy distribution is very di�erent. P2+dissipates twice as much power as P1+, but P1+ require twice as much time for executing

the selected task (Figure 3.5 (a) and (c)), resulting in a better EDP for P2+ (faster with

the same energy). When voltage scaling is applied to the processors to reduce their energy

consumption, their clock frequency must also decrease. For P1 (Figure 3.5(a! b)) this results in

decreasing both energy consumption and power dissipation, and in having a loss in performance.

Consequently, in a �rst approximation there is no gain in EDP (less energy, but also less

performance). Exactly, the same phenomenon occurs for P2 (Figure 3.5(c! d)).

However, when parallelism is used in conjunction with frequency and voltage down-scaling

(Figure 3.5(a! d)), one can observe that P2�

is much more energy e�cient than P1+. Indeed,

P2�

with a lower frequency and a lower power supply voltage has the same performance level

as P1+ (Ta = Td), while consuming less energy than P1 (Ed < Ea), and also dissipating less

power (Pd < Pa).

Table 3.1 qualitatively summarizes how parallelism and voltage scaling are a�ected by the

time of execution, the power dissipation, the energy consumption, and the energy-delay product.

3.4 Conclusion 31

This comparison is relative to the P1+. The signs +/- indicate respectively an improvement or

a degradation of the compared parameter.

Con�guration Time of Execution Power Dissipation Energy Consumption Energy-Delay Product

P1�

increased (-) decreased (+) decreased (+) equal

P2+ decreased (+) increased (-) equal decreased (+)

P2�

equal decreased (+) decreased (+) decreased (+)

Table 3.1: Summary of the bene�ts of parallelization and voltage down-scaling.

The above explanations do not take into account the circuit overhead introduced by the

use of parallel execution. An increase in complexity can dramatically reduce the bene�ts of

hardware duplication in terms of energy e�ciency. In low-power microprocessor design, it has

been demonstrated that pipelining is an e�ective way to improve a processor's energy e�ciency

[24], because of its inherent simplicity. Similarly, it was shown in [24] that the overhead for

superscalar general purpose architectures limits a processor's energy e�ciency. However, some

studies have suggested that EPIC and VLIW architectures, due to their hardware simplicity,

can execute a task with the same energy as an analogous scalar architecture [15][52].

3.4 Conclusion

This chapter gave an introduction to the power consumption in CMOS circuits. First, the

di�erent sources of power dissipation were described, and it was shown that the switching

activity contributes for around 90% of the total power dissipation. Second, di�erent metrics

were introduced and their meaning was explained in order to understand in which context

they should be used. Finally, it was explained through one Spice simulation and one qualitative

example how parallelism and voltage down-scaling can improve the energy e�ciency of a circuit.

These examples strongly motivate the use of parallelism for the design of an energy e�cient

microprocessor. However, at this point the power consumption added by the circuit overhead

introduced by the parallelization of the architecture was neglected. These negative e�ects are

investigated in the following chapters, but �rst, the next chapter describes the state of the art

in low-power and ILP processor design.

Chapter 4

Mobile and VLIW Processors:

a State of the Art

The embedded processor market o�ers a wide range of products that meet di�erent require-

ments of performance, cost, and power consumption. This chapter gives an overview of

some of these embedded processors with a main focus on low-power mobile processors. Also,

since parallelism relates to both performance and energy e�ciency, several VLIW architectures,

such as DSP or high-performance processors, are described. The main goal of this chapter is

to point out the main characteristics and trade-o�s of mobile processor designs, as well as the

lack of ILP exploitation in such processors.

4.1 The Advanced RISC Machine (ARM) Family

Exhibiting various desirable features, such as low-power consumption, tiny core size, and several

�exible modular options, the ARM [71] architecture has become one of the most popular prod-

ucts for ASIC design. With a 32-bit load/store architecture and a �xed-length 32-bit instruction

word, the ARM architecture follows the RISC standards. ARM is built around a scalar pipeline

that allows most of the instructions to be executed in one cycle, with the exception of memory

and branch operations.

From a programming point of view, the ARM family o�ers 16 32-bit integer registers

and an instruction set that has some original aspects. First, each and every instruction can be

conditionally executed upon the value of four condition codes, allowing to reduce the code size in

conditional branch intensive code. Second, all arithmetic and logic operations can intrinsically

shift or rotate one of their source operands.

4.1.1 The ARM7 Generation

Currently, ARM7 [71] is the low-end product of the ARM family. Based on a 3-stage scalar

pipeline (fetch, decode, execute), it exhibits a small die area, and a low power consumption.

Such features make ARM7 the perfect processor for low-power, low-cost applications. However,

this simple architecture, and particularly the short pipeline, results in a slow clock speed,

and therefore in a poor level of performance. As an example of an ARM7 implementation, the

ARM710 from VLSI, with an 8K uni�ed cache, an MMU, and implemented in a 0.8� technology,

runs at 25 MHz at 3.3 V, delivers 30 MIPS, and consumes 120 mW [1].

To overcome this performance limitation, the ARM's next generation introduced a longer

pipeline. The ARM8 has a conventional 5-stage pipeline which allows clock speeds of over

33

34

Mobile and VLIW Processors:a State of the Art

100 MHz in a 0.35� technology. ARM8's other major contributions to greater performance

were a static branch prediction and a double-speed cache that made transfers on both rising

and falling clock edges. Unfortunately, the double-speed cache of the ARM8 generated a new

problem. To e�ciently fetch instruction at 100 MHz, a custom physical layout of the processor

instruction fetch is required. Such design practices go against the ARM's premise of providing

easy to integrate portable CPU cores. This is why the ARM8 is no longer in the ARM roadmap.

4.1.2 The StrongARM

The StrongARM comes from a collaboration between ARM and Digital. The StrongARM SA-

110 [65] is a 32-bit embedded processor which exhibits a very desirable balance of performance

and power consumption. It is composed of a 5-stage scalar pipeline and is implemented on

a 0.35� technology. These parameters, in conjunction with a power e�cient design of the

processor core and cache memories, allow the SA-110 to achieve a high level of performance while

keeping the power consumption at a low-level. The SA-110 consumes around 500 milliwatts at

1.65 volt with a frequency of 160 MHz, and its level of performance is of 185 Dhrystone MIPS.

A new version of StrongARM SA-110 was implemented by Intel, reaching better perfor-

mance and a lower power dissipation. Intel's SA-110 runs at 233 MHz at 2 V, delivers 268 MIPS,

and consumes only 360 mW including the two 16K cache memories1 [1]. These features make

the SA-110 one of the best low-power processor that can be found in the market.

4.1.3 The ARM Thumb Option

Code size has strong repercussions on system cost, power consumption, and instruction cache

performance, that make it an important architectural issue. For this reason ARM designed

the Thumb architecture [63]. The ARM design addresses the code size issue by introducing

a 16/32-bit variable instruction width (ARM has a 32-bit �xed instruction width). Thumb

can switch between two modes of execution; one where it can execute 16-bit instructions that

maps the most frequently executed ARM instructions, and the other one where it can execute

32-bit instructions corresponding to the ARM instruction set. This variable instruction length

mechanism results in a 25% to 35% code size reduction.

To support this new mode of execution, ARM introduces a second decoder in parallel

with the original one. A program-visible bit directs incoming instructions toward the ARM

instruction decoder or the Thumb instruction decoder. The mode bit can be changed through

a new branch-and-exchange instruction, implying that 16-bit and 32-bit instructions can not

be randomly mixed.

4.1.4 The ARM Piccolo Option

The Piccolo [66] option adds DSP capabilities to the ARM by adding a DSP core into the ARM

architecture. Piccolo and ARM have separate registers and separate instruction memory and

communicate through a kind of reorder bu�er. With this option the ARM core controls the

chip and fetches the operands in memory, while Piccolo concentrates in the signal processing

computational part. One of the major drawback of this approach is that when the Piccolo

is running, a lot of the bandwidth of the ARM is lost in feeding Piccolo. Another problem

with Piccolo is that it has no X an Y memory band like the traditional DSPs. The Piccolo's

operands are supplied by its register �le, and su�er of a low operand bandwidth in case of a

data-intensive algorithm.

1The cache memories count for around 30% of the total power consumption

4.2 The Motorola M�Core 35

4.1.5 The ARM9 and the ARM10

The ARM9 [69] is the bridge between the ARM7 and the StrongARM. ARM9, as compared to

the ARM7, extends its pipeline to �ve stages, allowing it to run at 150 MHz. In addition, ARM9

splits ARM7's uni�ed internal bus and cache in two, giving the new core a Harvard architecture.

ARM9 does not include branch prediction (ARM8 does) and have a branch penalty of three

cycle when ever a branch is taken.

Implemented in the VLSI's 0.35� process, the ARM940T (the �nal T means that the

Thumb option is included) has a core die area of 4 mm2, runs at 150 MHz, and consumes 675 mW

at 3 V (including caches and MMU). Such features represent a signi�cant improvement over

the ARM7; however, the StrongARM still remains better in terms of performance and power

consumption.

This is why the ARM10 [26] pushes the ARM instruction set to a new performance level.

ARM10 has the same basic 5-stage pipeline as the ARM9, but it was reoptimized, allowing it

to reach 300 MHz in a 0.25� technology while having a 1 W power budget. ARM10 implements

a simple static branch prediction technique (backward taken, forward not taken). As in the

ARM9, there is a 3-cycle misprediction penalty; however, mispredictions occur less frequently.

ARM10 allows several units to work in parallel; however, it can issue only one instruction per

cycle. The ARM10's core can also be paired with a �oating point unit.

4.2 The Motorola M�Core

Motorola is well known in the embedded market for its 68xxx family; however, this family is be-

coming old, and Motorola introduced the new M�Core family [70] to compete in the burgeoning

market for portable hand-held devices. The M�Core family is designed to be a low-power 32-bit

architecture [57], and has 16 32-bit general-purpose registers. In addition, it has an alternate

register �le composed of 16 other registers that can be used for interrupt handlers or other

time-critical routines.

M�Core addresses the code size issue by using a 16-bit �xed-length instruction set. Subject

to this limitation, all register-to-register operations are destructive, with the result replacing

one of the source operands. Also, the immediate values are generally limited to 5-7 bits. Branch

displacement is coded in 11-bit value which covers 98% of branches [61]. This 16-bit coding

approach has a very attractive code density that is 50% smaller than ARM7 and 11% smaller

than Thumb code [70].

The MMC2001 [73] is one of the �rst implementations of the M�Core family and it is

dedicated to be an industrial controller. This microcontroller is based on an M�Core core

and integrates on the die: 256K of ROM, 32K of SRAM, and several modules such as pulse-

width modulation (PWM), UARTs, or serial-peripheral interface (SPI). Implemented in a 0.35�

technology and with a 2 V power supply voltage, M�Core runs at 34 MHz, delivers 31 MIPS,

and consumes 80 mW [46].

The M300 [72] M�Core generation makes a step toward a higher level of performance by

including a better branch handling technique (forward not taken, backward taken) and optional

single precision �oating-point support. Implemented in a 0.25� technology, this new generation

runs at 100 MHZ, with a power supply voltage of 2V.

36


4.3 The LSI TinyRisc

TinyRisc [68] is similar to the ARM's Thumb option, but for the MIPS instruction set. The

TinyRisc includes two decoders to handle 16-bit and 32-bit instructions having di�erent opcodes.

As with Thumb, 16-bit and 32-bit instructions can not be mixed, and a jump or call instruction

should be used to change of mode. The 16-bit mode has several limitations: (1) only 8 of the

32 registers are available; (2) most of the register-to-register operation are destructive; and (3)

immediate values are coded with one byte. In order to avoid a size limitation on indirect branch

o�sets, branch instructions automatically concatenate the next 16-bit instruction word, resulting

in a 26-bit branch o�set as in the 32-bit instruction set. Additionally, an EXTEND instruction

can be used to expand some of the immediate �elds of the 16-bit instruction, eliminating some

of the switching between 16-bit and 32-bit modes.

TinyRisc exhibits the same code size as the ARM7 with the Thumb option. However,

there is a cost in terms of performance because the extra logic added in the �rst pipeline stage

to support the 16-bit instruction set increases the critical path, thus leading to a reduction of

the clock frequency from 80 to 70 MHz. Note, that the operating frequency is still signi�cantly

higher than the conventional 40 MHz of the ARM7 with Thumb option.

LSI implemented the TinyRisc TR4102 in 0.25� fabrication technology, and it runs at

80 MHz, consumes 0.5 mW/MHz at 1.8 V, and has a die area smaller than 1.5 mm2.

4.4 The Hitachi SuperH Family

The SuperH [71] family has become very popular when Sega chose for its Genesis game console

a �rst-generation Hitachi SuperH.

The SuperH family passed through several generations from SH-1 to the current SH-4.

Again, the code size issue is addressed through a 16-bit �xed-length instruction set. As there

is no possible extension for the instruction word, there are some limitations: the size of the

immediate value is limited to 8 bits and the register-to-register operations are destructive.

The SH-1 generation with its 16-bit external bus, low clock speed, on-chip ROM, peripheral

functions, and lack of cache, is the lowest-performance device of the family and can be classi�ed

as a microcontroller. The SH-2 generation introduces only minor changes: a wider 32-bit

external data bus, a better multiplication unit and a 4K uni�ed cache. The most popular

processor of this family is the SH7604 that was used in the Sega Genesis 32X. The SH-3 family

makes a step toward a higher level of performance and targets application such as PDAs. As

compared to the previous SH-2 generation there is no major architectural changes, however,

the chips run approximately four times faster and includes an MMU, and larger uni�ed caches.

SH-3's chips such as the SH7708 can be found in several Windows CE units. The SH7708

typically dissipates 700 mW at 3.3 V, 100 MHz, with a level of performance of 100 Dhrystone

MIPS [64].

The latest SH-4 generation with the SH7750 [67] makes a substantial architectural change

supporting two-way superscalar execution and adding acceleration for �oating-point 3D geo-

metric processing. There are some restrictions for parallelizing instructions, for example the

SH-4 cannot dispatch two similar operations (ADD with ADD, �oat with �oat, etc), and it

can not mix certain multicycle instructions with others. However, the chip can mix integer and

�oating-point operation with no con�ict. The chip is implemented in a 0.25� technology, it

has a 5-stage pipeline, runs at 200 MHz, and consumes around 1.6 W when it is powered at

1.8 V [30]. It delivers 300 MIPS.

4.5 VLIW Architectures 37

4.5 VLIW Architectures

Currently, VLIW architectures are not commonly found in the processor market. In high-

performance workstation processors, VLIW architectures are beginning to appear with the

future HP/Intel IA-64 [25] [27] and the Transmeta x86/VLIW [10]. These designs are still not

available in the market: for example, Merced, the �rst generation IA-64, is expected in the

year 2000. In contrast, in the low-power embedded processor market VLIW machines have not

yet been introduced. For the moment, only Fujitsu is planning to design a low-power VLIW

processor [28]. The only domains where VLIW architecture can be found are within multimedia

and DSP processor systems. For example, the DSP Texas Instrument TMS320C6201 [74], the

Motorola/Lucent StarCore [78], or the Philips Trimedia [17]. This section gives a brief overview

of these VLIW processors.

4.5.1 The Texas Instrument TMS320C6201

The TMS320C6201 [74] is a VLIW-like DSP processor. It runs at 200 MHz and can issue up

to eight instructions per clock cycle. The core has an eight-way multi-issue 11-stage pipeline

that is divided in two clusters of four units. Each of the clusters contains a 40-bit integer ALU,

a 40-bit shifter, a 16-bit multiplier and a 32-bit adder. The register �le is composed of 32

general-purpose 32-bit registers, that are divided in two banks of 16 registers, one bank for each

cluster.

The instruction fetch mechanism includes a NOP elimination technique that reduces the

penalty due to the explicit NOP insertion required in conventional VLIW architectures. The

processor fetches a 256-bit meta-instruction, which is composed of 8 32-bit instructions. The

least signi�cant bit of each instruction is used to form execution packets among the 8 fetched

instructions. An execution packet de�nes a group of instruction that can be executed in parallel.

The next meta-instruction fetch is made once all the 8 instructions, contained in the current

meta-instruction, are sent to a functional unit. One important feature is that all the instructions

can be conditionally executed based on the status of �ve condition registers.

Even with the NOP elimination technique that reduces the code size penalty of the VLIW

architecture, the TMS320C6201 has signi�cant code expansion due to its deep pipeline, lack

of branch prediction, and �xed length 32-bit instruction. The fast 11-stage pipeline causes the

complex operation to have di�erent latencies, making the programming task much more di�cult

since there are several delay slots to �ll. For example, the 'C6201 has no branch prediction,

therefore all taken branches introduce a 5-cycle penalty which corresponds to a 40-instruction

(5 cycles times 8 instructions per cycle) branch delay slot. The number of delay slots in the

TMS320C6201 is unconventionally large and it is very di�cult to �nd a su�cient number of

delay slot instructions.

The TMS320C6201 exhibits a high power dissipation. In 0.25� technology it consumes,

including cache accesses, 4.65 W at 2.5 V, 200 MHz [62].

4.5.2 The Motorola-Lucent Star*Core

The Star*Core [78] is considered a new generation of VLIW DSP processor. It targets a wide

range of application by o�ering a scalable high-performance low-power VLIW DSP architecture.

The Star*Core uses 16-bit instructions and introduces optional instruction pre�xes that

enable the full power of a 32-bit instruction set. Such variable instruction length mechanism

has a code density that is much better than conventional DSPs, and comparable to those of

M�Core and ARM7 with the Thumb option [78].

38


The Star*Core SC140 [4][3] is the �rst core of the SC100 family, it is implemented in the

Motorola's HIP6 0.13� process, and it delivers up to 1.2 billion MAC (multiply-accumulate)

operation per second or 3000 MIPS. It has a total of 16 functional units, including MAC units,

ALUs, Bit Field units, Address Computation Unit. Also, it has di�erent sizes for the datapaths:

16 bits for the data, 32 bits for the addresses, and 40 bits for the accumulators. Star*Core's

pipeline is composed of only �ve stages. With a power supply voltage of 1.5 volts, it runs at

300 MHz and consumes 0.1 mA/MIPS.

4.6 The Philips Trimedia

The Philips Trimedia processor [17], is a 32-bit VLIW multimedia machine that targets digital

TV, and full-speed DVD decoding. The Trimedia architecture provides 128 general-purpose

registers, and 25 execution units, including constant generators, several ALUs, DSP execution

units, integer multipliers, integer shifters, branch units, load-store units, and �oating point

units. DSP units have instructions with special functions such as a single-cycle 8-bit motion

estimation that works on 32-bit operands. Trimedia also supports conditional execution. In

addition with the VLIW architecture, there is a compression hardware that avoids wasting

memory space and bandwidth with NOPs. The TM-1000 is built in 0.35� technology and runs

at 100 MHz, and consumes around 4 W (typ) at 3.3 V [49].

4.7 The HP/Intel IA-64

Intel and Hewlett-Packard work together to design the new generation of high-performance

workstation processors, the IA-64 [25]. The Merced will be the �rst chip of this family, and it

is based on a VLIW-like architecture called EPIC (Explicitly Parallel Instruction Computing).

Merced has a 64-bit datapath, and is supposed to have 128 integer registers and 128 �oating-

point registers. Furthermore, Merced is a fully-predicated execution architecture and has a

strong support for speculative execution.

To avoid NOP insertions, IA-64 groups operations in 128-bit bundles, that contains three

instructions and one template. The template is used to explicitly describe the available paral-

lelism between instructions within a bundle.

The Merced chip should be released in the year 2000, and it will be built in a 0.18�

technology. The chip is expected to run at around 800 MHz, and have performance advantage

of 20-30% over a RISC-like architecture. In terms of power consumption, a Merced module

containing 4M of full-speed cache is estimated to dissipate more than 70 W [27].

4.8 Comparison

Table 4.1 gives a summary of the features of each of the processors that were described in this

chapter. The upper part of the table is composed of mobile processors, and the lower part is

composed of VLIW processors. The MIPS performance are Dhrystone MIPS for the mobile

processors, and a MIPS peak number for the VLIW machine (numbers in italic). Furthermore,

numbers beginning with a '?' are estimates.

This table shows that de�ning the best processor is a hard task because of the several

design parameters. Considering the trade-o� between performance and power dissipation of

the mobile processors, the TR4102, the StrongARM, and the SH7750 dominate2 all the others

2A processor dominates an other processor when both performance and power dissipation are better.

4.9 Conclusion 39

Model Vendor Techno. Vdd Freq. Power MIPS MIPS/W MIPS2/mW

� StrongARM Intel 0.35� 2 V 230 MHz 360 mW 268 744 200

� ARM710 VLSI 0.8� 3.3 V 25 MHz 120 mW 30 250 8

� ARM940T VLSI 0.35� 3.3 V 150 MHz 675 mW ?160 ?237 ?38

� MMC2001 Motorola 0.35� 2 V 34 MHz 80 mW 31 387 12

� TR4102 LSI 0.25� 1.8 V 80 MHz 40 mW ?90 ?2250 ?203

� SH7708 Hitachi 0.5� 3.3 V 25 MHz 95 mW 25 263 7

� SH7750 Hitachi 0.25� 1.8 V 200 MHz 1.6 W 300 188 56

8. 'C6201 TI 0.25� 2.5 V 200 MHz 4.6 W 1600 348 557

9. SC140 Mot./Lucent 0.13� 1.5 V 300 MHz 500 mW 3000 6000 18000

10. TM1000 Philips 0.35� 3.3 V 100 MHz 4W 2500 625 1563

11. Merced HP, Intel 0.18� ? 800 MHz ?70 W 6400 91 585

Table 4.1: Mobile, Embedded, and ILP processor comparision.

54

350

125

250

375

500

625

1625

Power [mW]

MIPS300250

2

6

3

7

1

50 100 150 200 50 100 150 200 250 300 350

1

2

3

4

5

6

mW/MIPS

MIPS

4

62

3

5

1

7

Figure 4.1: Comparison: (a) MIPS vs. Power; (b) MIPS vs. mw/MIPS.

mobile processors (see Figure 4.1(a)). The TR4102 exhibits a very low-power consumption,

and it has the best MIPS/Watt rating, meaning that the TR4102 is the processor that require

the lowest energy to execute the Dhrystone benchmark. However, in terms of performance the

StrongARM and the SH7750 are much better.

For the energy-versus-performance trade-o� the same three processors are dominant (see

Figure 4.1(b)). The TR4102 and the StrongARM exhibits roughly the same MIPS2/W number,

which is equivalent to the energy-delay product. The SH7750 has a signi�cantly smaller number

due to its high power dissipation. The only factor that makes the SH7750 a dominant processor,

is its high level of performance, that might be needed in some time critical applications.

The level of power dissipation of the VLIW architecture is much higher than the mo-

bile processor with the exception of the StarCore that has a very low power consumption of

500 mW, while delivering a peak performance of up to 3000 MIPS. These numbers are di�cult

to compare to the ones of the mobile processors, because these architectures are dedicated to

very di�erent types of applications. Nevertheless, StarCore exhibits a very attractive trade-o�

between performance and power dissipation.

4.9 Conclusion

This chapter provided an overview of the best low-power 32-bit mobile processors that can be

found in the market. The features of architecture, design, instruction set, performance, power

consumption, and code size were described. Generally for embedded processors, ILP is only

40


exploited using pipelining techniques. There are a few exceptions: For example, the ARM10

allows several units to work in parallel; however, only one instruction can be issued per cycle.

Also, the SH7750 has introduced a superscalar pipeline; however, it has several restrictions

to parallelize instructions, and even if it has a very good level of performance, it consumes

much more power than the others mobile processors. Surprisingly, VLIW architectures have

not yet been introduced in the mobile processor market even though their inherent simplicity

can o�er low power consumption and improved performance relative to scalar architectures.

Current VLIW architectures are mostly found in DSP multimedia processors, that exhibit a high

instruction throughput, and can have a very low power consumption like the Motorola/Lucent

StarCore.

The previous chapters outlined that parallelism can be used either to speedup perfor-

mance or to reduce power dissipation. Both aspects of parallelism are very useful for low-power

mobile processors. The rest of this work investigates trade-o�s between energy consumption

and performance in VLIW machines, and how VLIW architectures can be introduced into a

low-power architecture. The next chapter gives a high-level evaluation of the bene�ts of VLIW

architectures for low-power processors.

Chapter 5

Low-Power VLIW Processors:

A High-Level Evaluation

Previous chapters describe that parallelism can either speed up the execution, or, when it is

used in conjunction with clock frequency and voltage down-scaling, reduce the total energy

consumed to complete a task with no loss of performance. Clearly, designers should embrace

techniques with this favorable characteristic whenever possible.

However, Chapter 4 showed that although pipelining is commonly integrated into low-

power embedded architectures, superscalar or VLIW architectures are generally not introduced

into embedded processors. Investigations into the overall energy e�ciency of pipelined and

superscalar architectures when used in general purpose processors demonstrated that super-

scalar execution does not signi�cantly e�ect the energy e�ciency of such processors [24]. This

is mostly due to the hardware overhead introduced by the superscalar architecture. The key to

solve this problem is in exploit parallelism and use pipelining while reducing the overhead found

in superscalar architectures through the use of advanced compiler techniques. With combined

hardware and compiler techniques, much of the work performed by traditional superscalar pro-

cessors can be moved from run time to compile time. New developments in the VLIW �eld, such

as the new architectural solutions HP/Intel IA-64 [25], TI 'C6201 processors [74], and the future

Fujitsu FR-V architecture [28], gives a strong motivation for the use of VLIW architecture for

low-power processors.

This chapter gives a �rst high-level quantitative evaluation of the bene�ts of VLIW ar-

chitectures for energy e�cient processors. In order to do so, several implementation of scalar

and VLIW architectures are compared in terms of both performance and energy consumption.

The remainder of this chapter describes the experiments that have been carried out to do this

evaluation.

5.1 Description of the Experiment

These experiments compare several scalar and VLIW architectures in order to determine whether

VLIW architectures can improve processor energy e�ciency. To do so, several VLIW architec-

tures are derived from the existing CoolRISC family [50], and a comparison is made in terms of

performance and energy consumption. This comparison is made through high-level estimates

of the energy consumption and through the performance achieved in some local piece of code.

As most of the execution time of a program is spent in inner loops, the performance achieved

and the energy consumed in inner loops can be considered as representative of the execution of

41

42

Low-Power VLIW Processors:A High-Level Evaluation

the entire program. For example, on a HP-PA 7100 processor, 78% [38] of the execution time

of the Perfect Club Benchmark Suite [11] is spent in inner loops. So, this experiment focuses

on the execution of a benchmark suite composed of inner loops of several programs. Such con-

siderations lead to a loss of precision compared to real code execution on real circuits; however,

this evaluation provides the initial validation for using of VLIW architectures in low-power

processors before designing a complete framework composed of a compiler and a circuit.

To run this experiment the functionalities of a framework developed at the Universitat

Politecnica de Catalunya (DAC, Barcelona, Spain) were extended in order to generate the

code for our di�erent architectures. Figure 5.1 shows the block diagram of this enhanced

framework. The inner loops are extracted from the benchmarks and optimized thanks to the

ICTINEO tool [9], which extracts the inner loops of a FORTRAN program and provides an

optimized graph of dependences for each inner loop. ICTINEO performs several optimizations

in order to eliminate the unnecessary dependences (which limit instruction-level parallelism) and

instructions. The elimination of the unnecessary dependences is performed by keeping high-

level information about the data dependencies: for example, in case of an access to an element

of an array, ICTINEO keeps the index of the element, which allows the memory dependences to

be identi�ed. Indeed, if this information is not kept at the assembly instruction-level, it would

be impossible to know if the indirect memory accesses to the various elements of the array

are independent. The elimination of the unnecessary instructions is achieved using common

expression elimination and invariant extraction. The �rst method eliminates those groups of

instructions that produce the same results. The second extracts from the loop those expressions

that compute a result which does not depend on the iteration. These optimizations reduce the

number of instructions to be executed.

Benchmark

Inner Loop Extraction

Code GeneratorMachine Description

Loop Optimizations

CoolRISC

SchedulingSMS: Swing Modulo Scheduling

PerformanceEnergy Sonsumption

ICTINEO

Figure 5.1: Block diagram of the experimental framework.

After that, the code and its corresponding dependency graph are generated for the Cool-

RISC 8-bit and 16-bit instruction sets. Then, using the dependence graph and a machine

description, software pipelining has been used to schedule operations because it is the most

e�ective compiling technique for loop parallelization. The software pipeline technique used is

Swing Modulo Scheduling (SMS) [39]. SMS tries to produce the maximum performance and,

in addition, includes heuristics to decrease the high register requirements of software pipelined

5.2 CoolRISC 816: A Low-power 8-bit Processor 43

� � � � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � � � �

� �

� �

� � � �

� �

� � � �

� �

� � � �

� �

� � � �

� �

� � � �

� �

� � � �

� �

� � � �

� �

� � � �

� �

� �

� ��

� � � � � � � � � � � �

� � � � � � �

�

� ��!

�

"

� � � � � � � � � � �

� � � � � � � � � � �

� � � � � � � � � � �

� � � � � � � � "

� � # � � � � � � � � � � � � � � � � �

Figure 5.2: Parallel execution of a loop using software pipelining.

loops [36]. When the number of registers required is higher than the available number, spill

code [18] (i.e, instructions which temporarily save the contents of some registers into the data

memory) has to be introduced, increasing energy consumption. When required, spill code was

added in software pipelined loops using the heuristics described in [37].

Figure 5.2 shows the principle of software pipelining. In the sequential execution of a loop

each iteration and each instruction are executed sequentially. Software pipelining rearranges

the instructions, according to the dependencies and architectural constraints, in order to obtain

a loop divided in SC stages (three in our example) which can be executed in parallel. Every

stage is executed in II (Initiation Interval) cycles, and multiple instructions can be executed in

parallel.

Following sections describe the compared architectures, the consumption model, and �nally

give and comment the results.

5.2 CoolRISC 816: A Low-power 8-bit Processor

CoolRISC 816 [50] has been developed by the Centre Suisse d'Electronique et Microtechnique

(CSEM, Neuchatel, Switzerland). As this thesis aims at extending the features of the CoolRISC

family, the CoolRISC 816 is the base line processor for this experiment.

Following subsection presents the CoolRISC 816, pointing on the main architectural fea-

tures, the performance limitation, and the energy consumption distribution.

5.2.1 The CoolRISC 816 Architectural Characteristics

The CoolRISC 816 is designed to be an ultra low-power embedded 8-bit microcontroller, and

has the following characteristics (core only):

� Harvard architecture: separate code and data memory

� Three-stage non blocking pipeline (IPC=1.0)

� Sixteen 8-bit registers

� 22-bit wide instructions

44


� A maximum of 64k x 22 bits of ROM code memory

� A maximum of 64k x 8 bits of RAM data memory

� 8b x 8b parallel-parallel multiplier

� Clock frequency of up to 18 MHz

� Typical consumption of 105 �W/MHz at 3 volts

� 19,000 transistors

� 0.5 �m three metal layers CMOS technology (Mietec)

� 0.8 mm2 area

The CoolRISC instruction set contains low-power instructions such as FREQ or HALT,

which allow, respectively, to reduce the microcontroller's clock frequency and to stop all ac-

tivity in the processor. The CoolRISC's addressing mode includes direct addressing, indirect

addressing with o�set, and pre-decrementation or post-incrementation. The ALU's operands

may be registers, immediate values, or memory data. The ALU's result is always stored in a

register which can be di�erent from the operand registers.

These features allow CoolRISC 816 to obtain an ultra-low power dissipation while achieving

a good level of performance compared to the other 8-bit microcontrollers that can be found in

the market.

5.2.2 The Performance of CoolRISC 816

The CoolRISC 816 has a non blocking pipeline which allows it to execute an instruction every

cycle without adding extra delay due to pipeline stalls. From a performance point of view, the

CoolRISC architecture's primary limitation is its clock frequency: the maximumclock frequency

of the CoolRISC 816 core is 18 MHz. Generally, the maximum working frequency is limited

by the access time of the code memory and the CoolRISC sacri�ces access time for low power.

Table 5.1 shows the energy consumption and the access time of the code memory used by the

CoolRISC 816 for a power supply voltage of 3V. As the access time of the code memory must

be one quarter of the clock period, at 18 MHz the required memory access time is 15 ns. This

means that for code memories with a size greater than 4k words the maximum clock frequency

is imposed by the memory access time.

ROM Size Energy (typ) Acces Time[�W/MHz] [ns]

256 x 22 75 54k x 22 205 2016k x 22 375 40

Table 5.1: Characteristics of CoolRISC's low-power ROM (Vdd=3V).

5.3 Compared Architectures 45

5.2.3 The Energy Consumption of the CoolRISC 816

The energy consumption of the CoolRISC 816 can be divided in three di�erent parts: the

core, the data memory, and the code memory. Figure 5.3 shows the typical distribution of

the energy consumption when CoolRISC is executing a program. This data was obtained by

executing a set of programs and extracting the relative utilization of the core, data memory,

and code memory. The set of programs used consisted of: a quicksort, a stringsort, a FFT, and

a sine/cosine computation. The average resource utilization is presented in Table 5.2.

49%

26% 16%

35%

50%56%

16% 24% 29%

0%10%20%30%40%50%60%70%80%90%

100%

256x22, 128x8 4kx22, 2kx8 16kx22, 8kx8

Core Code memory Data memory

CodeMemory

DataMemory

Figure 5.3: Energy consumption distribution in the CoolRISC 816.

Table 5.2: Relative utilization of the core, the code memory, and the data memory

Core 100%, the core is used every timeCode memory 100%, one instruction is fetched at each cycleData memory 40% of instructions access the data memory

Figure 5.3 shows that the energy consumed by the processor core corresponds to less than

50% of the total energy consumption and that the major sources of energy consumption are the

memories.

5.3 Compared Architectures

This section introduces the di�erent architectures that are compared in this experiment. All

of these architectures are based on the CoolRISC architecture and use the same low-power

memories that have been used with CoolRISC 816. These memories have the property of having

no sense ampli�ers, consequently static energy consumption can be neglected. Therefore, there

is no additional penalty due to the width of the instruction words.

The evaluated architectures are divided into two groups: the 8-bit and 16-bit scalar archi-

tectures, the 8-bit and 16-bit VLIW architectures.

46


5.3.1 Scalar Architectures

The compared scalar architectures are the 8-bit and 16-bit coming from the CoolRISC family.

The CoolRISC 816 (C8) is the base line processor of this experiment and is described in

Section 5.2. The CoolRISC 1616 (C16) processor is a 16-bit version of the CoolRISC 816, the

only di�erence being that all the data are 16-bit wide.

5.3.2 VLIW Architectures

VLIW architectures may su�er of an increase in code size due to explicit NOP insertion. To solve

this problem the new generation of VLIW processors, such as the TI'C6201 [74] and HP/Intel

IA-64 [25], contain special encoding techniques which eliminate the extra NOP instructions.

Figure 5.4 illustrates this technique. Each VLIW instruction encodes several operations (four in

our example) which could be dependent or independent. An additional �eld is added to specify

the group of operations that will be executed in parallel. The unit number �eld speci�es

which unit must execute the operation, and the separator bit between two operation within

a operation is set to '0' if the two can be executed in parallel, to '1' if they must be executed

sequentially. The hardware costs of the NOP elimination are the extra bits added to the code

memory (3 bits per operation in our example) and the crossbar needed to send the operation

to their corresponding unit. However, this technique prevents the increase in code size (and

therefore of consumption) due to the extra NOP insertion. For example, in our experiment

a VLIW processor with four units has a speed-up of about 2. This means that 50% of the

operations are extra NOPs. Therefore, a VLIW architecture with extra NOP elimination will

have a decrease in the code size by a factor of two compared to a VLIW processor without extra

NOP elimination. This NOP elimination technique is used in all the compared architectures.

� $ � � $ � % & "' ( ( �

� $ �� $ � ' ( (

� $ � � $ �" $ ' (" $ ' (

� � � ) ) � � � � � * " � +

% & "' ( ( � " $ ' (" $ ' ( � � ��

' ( (" $ ' (, , - - -, , - - - � � ��

" $ ' (

� $ �

� $ �

" $ ' (

� $ �

" $ ' (

� $ �

' ( ( �

' ( (

� $ �

% & "

� $ �

��.

&��!/��

��

��0�*"�+

� � ! � � ) ) � �

� � ) � � � � � � � �

� � � ) ) / � �

� � � � � � �

� � ) � � � � � � � �

��

� �

�

� �

�

" $ ' ( �

Figure 5.4: VLIW architecture: NOP elimination.

Heterogeneous VLIW architectures� Heterogeneous VLIW architectures are the

most common among existing VLIW architectures. The term heterogeneous indicates that the

units are di�erent, which in turn means that an operation must be dispatched to a unit capable

of executing it. The compared architectures are the following:

5.4 Consumption Model 47

� V8E1: 8-bit VLIW, 1 Branch unit, 2 ALUs, and 1 Load/Store unit;



� V16E2: 16-bit VLIW, 1 Branch unit, 2 ALUs, and 2 Load/Store unit.

Homogeneous VLIW architectures� Homogeneous VLIW architectures are VLIW

architectures composed of several units which are able to execute any kind of operation. We

compare the following architectures:

� V8H1: 8-bit VLIW, four homogeneous units with one memory access at a time;

� V8H2: 8-bit VLIW, four homogeneous units with two memory accesses at a time;

� V16H1: 16-bit VLIW, four homogeneous units with one memory access at a time;

� V16H2: 16-bit VLIW, four homogeneous units with two memory accesses at a time.

The NOP elimination technique described below is used in all of these VLIW architectures.

Nevertheless, when the units are homogeneous there is no need for a crossbar and an unitnumber �eld to dispatch the operations to their corresponding units. An operation, according

to its position into the VLIW instruction, is always executed by the same unit.

5.4 Consumption Model

The consumption model is based on the utilization of resources. The energy needed to execute

a task is computed by adding the energy consumed by the di�erent resources:

Eoper Energy needed by the processor core for executing an operation;

Ecode Energy needed for an access to the code memory;

Edata Energy needed for an access to the data memory;

Econn Energy consumed in the interconnection (e.g., crossbar);

ERFover Extra energy consumption due to the increase in the number of register �le

access ports.

After executing a loop, it is possible to know the number of accesses to the various re-

sources, Nresource�name, and therefore to compute an estimate of the energy consumption:

ET = Noper �Eoper +Ncode �Ecode +Ndata �Edata +Nconn �Econn +NRF �ERF : (5.1)

5.4.1 Estimate of Eoper

As CoolRISC 816 is our processor of reference, we base the energy consumption estimates on

the energy consumption characteristics of the C8 processor, which are extracted from the real

implementation of the processor.

Because the compared VLIW architectures use the NOP elimination technique, their in-

structions contain predecoded bits that indicate which units must work. Therefore, it is possible

to halt the signals activity of all unused units. As a consequence, the units which do not execute

an operation do not consume any energy.

48


For the heterogeneous VLIW architectures, the energy needed to execute an operation

Eoper is estimated as the energy consumption of the operational part of theC8 orC16 processor(pipeline, decoder, register �le accesses, ALU operation).

For the homogeneous VLIW architectures, the energy needed to execute an operation Eoper

is estimated as the same energy needed to execute a scalar instruction in the C8 or C16.

This high-level modeling of the energy consumed during the execution of an instruction

implies a certain loss of precision. However, one factor limits the impact of the error of esti-

mation: as described in Subsection 5.2.3, the energy consumed in the processor core represents

only a small part (about 30% to 50%) of the total energy consumption.

5.4.2 Estimate of Ecode and Edata

The energy needed to execute a memory (code or data) access is estimated through a statis-

tical energy consumption model of the memory architecture. This model takes into account

the type of memory (RAM or ROM), its size (in words), its geometry (number of rows and

columns), the width of the word, and the power supply voltage. The technological parameters

are extracted from a 0.5 � CMOS process. In our experiment we use the typical value of the

energy consumption per memory access.

5.4.3 Estimate of Econn and ERF

The extra consumption energy due to the interconnection is estimated using a statistical model

of the energy consumption of the crossbar and of the circuit overhead due to the additional

access ports of the register �le.

5.5 Benchmarks

For experimental evaluation, we used a set of 25 integer loops. These loops are divided into

three groups. The �rst includes �ve integer loops which operate on 8-bit data: FIR �lter, vector-

matrix multiplication, vector-vector multiplication (dot), vector-vector addition, and function

integration. The second consists of the same �ve integer loops operating on 16-bit data. Finally,

the third group is composed of 15 16-bit integer loops of the Perfect Club Benchmark Suite [11].

5.6 Results

In this subsection we compare the performance and energy consumption of the architectures

described in Subsection 5.3.1. The same power supply voltage (Vdd=3V) and clock frequency

(imposed by the access time of the code memory) is used for all the compared processors, and

the experiment is repeated for several memory con�gurations.

In Figure 5.5 we compare the performance, in terms of speed-up, of the di�erent processors

with respect to the C8 processor.

Figure 5.6 shows the ratio between the energy consumption of the di�erent processors and

the C8, while executing our benchmark. It illustrates the energy consumption distribution.

Figure 5.7 shows the ratio between the energy-delay product achieved by the di�erent

processors and by the C8 processor, while executing our benchmark.

From these three �gures we can observe the advantage of the transition: (1) from a 8-bit

to a 16-bit architecture, and (2) from a scalar to a VLIW architecture.

5.6 Results 49

1.0

2.32.5

3.1

2.4

5.15.4

7.36.8

2.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

C8 V8E1 V8E2 V8H1 V8H2 C16 V16E1 V16E2 V16H1 V16H2

Figure 5.5: Speed-up comparison.

CODE: 256 INSTRUCTIONSDATA: 128 WORDS

CODE: 4k INSTRUCTIONSDATA: 2k WORDS


0

0.2

0.4

0.6

0.8

1

1.2

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

core code data interconn.

Figure 5.6: Energy comparison.

50


CODE: 256 INSTRUCTIONSDATA: 128 WORDS



1.00

0.47

0.35

0.23

0.11 0.12

1.00

0.34

0.23

0.11 0.11

1.00

0.480.45

0.33

0.22

0.11 0.11

0.460.390.40

0.090.090.090.09 0.090.09

0.49

0.37

0.49

0.00

0.20

0.40

0.60

0.80

1.00

1.20

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

Figure 5.7: Energy-Delay Product comparison.

The transition from a 8-bit to a 16-bit architecture yields a major improvement in the

energy-delay product (approximately a factor of four). This is a consequence of the smaller

number of instructions required to execute the benchmark, which contains a majority of 16-bit

data. Performance increases by a factor of about 2.4 while energy consumption decreases by a

factor of about 1.7. This result shows the importance of having an architecture able to process

e�ciently the data of the application.

The transition from a scalar to a VLIW architecture signi�cantly improves the energy-delay

product (by a factor varying between 2.0 and 2.8). Indeed, VLIW architectures achieve better

performance while consuming approximatively the same energy as scalar architectures. This

observation is explained by a redistribution of the energy consumption: the increase in energy

consumption of the VLIW core is compensated for a decrease in the energy consumption of the

code memory. The increase is due to the circuit overhead introduced by the interconnections

(ERF and Econn). On the other hand, the decreased consumption of the code memory can be

explained. First, the employed memories do not have sense ampli�ers, and therefore do not

consume static energy (i.e., there is no penalty for the larger instruction words). Second, a

VLIW processor requires less energy to fetch an operation than a scalar architecture. In fact,

as the energy consumed by the line decoder is independent of the width of the word, the energy

needed to fetch four instructions simultaneously is less than four times the energy consumption

needed to fetch one instruction.

The main di�erence between homogeneous and heterogeneous VLIW architectures is in

terms of performance. The former reach a higher level of performance due to their higher

machine parallelism; however, the downside is the higher core complexity, which entails a higher

energy consumption. Therefore, if the ILP is insu�cient, the homogeneous and heterogeneous

VLIW architectures have a similar energy-delay product since there is no signi�cant di�erence in

terms of speed-up. On the other hand, if su�cient ILP can be extracted then the homogeneous

architecture attains a higher level of performance (as is the case for our 8-bit processors),

5.7 Conclusion 51

ultimately eclipsing the heterogeneous one with respect to the energy-delay product.

5.7 Conclusion

In this chapter we have shown that an adaptation of high-performance architectures, such as

the VLIW architecture, to low-power embedded 8 bit or 16 bit microcontrollers using low-power

memories yields a signi�cant improvement of the energy-delay product compared to a scalar

processor. This improvement, by a factor varying between two and three, is obtained through

a redistribution of the energy consumption, which enables a higher level of performance while

keeping the energy consumption at the same level. We have also shown the importance of using

a processor adapted to the size of the data in order to minimize the number of instructions

executed, which leads to a decrease in the energy consumption and in the time of execution.

Our results are based on loop parallelization and on a high-level energy consumption

model, which allow us to validate the use of VLIW architectures for high-performance low-

power processors and to identify which VLIW architecture provides the best results. The next

step will be to develop a complete VLIW compiler and a prototype of such a low-power VLIW

processor in order to extend these results to the real world.

52


Chapter 6

The DEVIL Low-power Processor

The previous chapter has validated the use of VLIW architectures for low-power processors

using high-level estimates. The next step in the design �ow is to de�ne and implement a

low-power VLIW processor with its compiler in order to obtain more accurate results about the

system features.

This chapter describes the instruction set architecture of DEVIL, our low-power VLIW

architecture. First, the VLIW's design trade-o�s are revisited in order to highlight the important

points that have to be taken into account when de�ning a new VLIW architecture. Second,

the DEVIL architecture is described and the design decisions are motivated. Third, the DEVIL

processor is evaluated in terms of performance and memory utilization. Finally, a comparison

with the existing instruction set architecture is made.

6.1 Where Is The Complexity in VLIW Architectures?

Introducing a multiple-issue pipeline into a processor adds complexity to the architecture. This

section revisits the architectural changes that are necessary to introduce a VLIW-like pipeline

into an architecture in order to better understand what the trade-o�s are and where new

solutions should be found.

6.1.1 Hardware Duplication

The most obvious increase in complexity is probably the hardware duplication (i.e., unit repli-

cation) needed to execute more than one instruction per cycle. Hardware duplication results in

an increase in the circuit die area, implying a higher circuit cost and potentially a higher power

consumption.

Although increasing the number of functional units (FUs) of a superscalar pipeline raises

the number of instructions that can be executed in parallel, there are other factors that limit

the achievable parallelism:

� The number of registers of the architecture versus the register pressure,

� The number of register ports,

� The type of instructions,

� The ILP available in the application (dependencies).

53

54 The DEVIL Low-power Processor

Therefore, the amount of hardware duplication should be adapted to these constraints.

Number of FUs versus number of registers � Executing more than one instruction

per cycle increases the register requirements. When the number of live registers is greater than

the number of available registers, spill-code should be inserted to temporary save and restore

registers to and from data memory. As the number of accesses to the data memory that can be

made in parallel are generally bounded (usually no more than two), this extra code is likely to

result in a degradation of system performance.

Number of FUs versus number of register �le ports � Adding extra functional

units also means that the units must exchange data via the register �le and the data memory,

resulting in an increase in the number of port accesses to such data storage elements. However,

increasing the number of port accesses directly a�ects the complexity, access time, and power

consumption of the register (or of the memory). This explains why the number of access ports

is generally limited, bounding the machine parallelism.

Number of FUs versus types of FU � The type of instructions in a program (e.g.,

branches, ALU operations) can also limit the e�ciency of the hardware duplication. Indeed,

although it is simple to parallelize computationally-intensive code, it is much more di�cult to

parallelize conditional branches, and generally processors can issue only one branch per cycle.

As applications contain a signi�cant amount of branches (around 20% of the total number of

instructions [29]), this severely limits the machine parallelism. The same problem occurs with

memory operations.

Number of FUs versus available ILP � The machine parallelism has to be adapted

to the parallelism that can be extracted from the targeted applications.

The choice of the amount of machine parallelism should take into account all the above

factors in order to obtain the best trade-o� between performance improvement and the hardware

overhead.

6.1.2 Code Memory

Chapter 5 showed the importance of code memory utilization in terms of power consumption

and performance. VLIW architectures, by their nature, strongly modify the interface between

the processor and the code memory. As a result, VLIW machines can incur in a big penalty

in terms of code size and memory bandwidth, directly a�ecting the circuit die area, the energy

consumption, the cost, and the instruction cache performance.

Originally, VLIW processors encoded in their instruction words the operations that each

functional unit should execute at the same time, resulting in the insertion of explicit no operation

instructions (NOPs) for unused functional units. These NOP insertions result in an increase in

code size, in memory bandwidth, and in the energy consumption of the code memory. Another

factor that a�ects the code memory utilization is the need for superscalar optimizations to

extract more parallelism from programs. Such optimizations generally imply a large amount of

code duplication (e.g, loop unrolling, tail duplication) resulting in a non-negligible increase in

code size.

Increase in code size � Code size directly a�ects the code memory die area, resulting

in an increase of the system cost. Figure 6.1 shows the die area as a function of the memory

size for an ultra-low-power embedded memory developed by the CSEM.

Power consumption is also correlated with the memory code size. Figure 6.2 illustrates

the relation between code size and power consumption.

Increase in memory width � Accessing a wider memory implies a greater energy

consumption. Figure 6.2 shows the increase in the energy required to access a 64-bit wide

6.2 De�nition of the DEVIL Processor 55

Figure 6.1: ROM code memory die area as a function of code size.

memory compared to a 32-bit one as a function of the memory size.

Increase in the number of accesses to the code memory � The energy consumption

of the code memory depends linearly (in the case of a static design) on the number of accesses

to the memory, which highlights the importance of reducing the tra�c between the processor

core and the memories.

In Chapters 4 and 5, several solutions have been described to reduce these negative e�ects

by including in the instruction word additional scheduling information. This mechanism (e.g.,

bundle formation in IA-64) implies an addition of extra hardware in order to dispatch the

instructions according to the encoded scheduling information. This approach trades o� the

inherent simplicity of a VLIW's fetch mechanism for a reduction of the instruction memory

overhead, while keeping the parallelism detection a lot simpler than the dynamic instruction

schedulers found in superscalar processors.

6.2 De�nition of the DEVIL Processor

DEVIL is a 32-bit VLIW machine that contains two ALUs, one branch unit, and one load/store

unit, implemented with a 3-stage pipeline. DEVIL can issue up to two instructions per cycle

with the restriction that neither two branch operations nor two load/store operations can be

parallelized together. Furthermore, DEVIL proposes a new encoding mechanism that combines

a NOP elimination technique (i.e., encodes scheduling information) with a variable instruction

length mechanism. Figure 6.3 depicts the block diagram of the DEVIL processor.

DEVIL targets the 32-bit mobile processors market and aims to be used as an ASIC

core. This imposes strong constraints in terms of system cost, circuit die area, and power

consumption.


Figure 6.2: Power consumption of the ROM code memory as a function of code size.

6.3 DEVIL's Registers

In order to support parallel execution, the register �le size requires a greater number of registers.

Scott Mahlke and al. [42] have shown that 16 registers are su�cient to exploit ILP in multiple-

issue machines with no performance loss due to the register pressure. DEVIL contains 16 32-bit

general purpose registers, like the majority of the current scalar mobile processors. DEVIL also

has some dedicated registers, called macro-registers. The DEVIL's available registers are:

� r0-r15: 32-bit general purpose registers,

� sp: 32-bit stack pointer (=r15),

� pc: 32-bit program counter,

� retaddr: 32-bit return address, used to save the pc during a jump to subroutine instruc-

tion, and also to restore the pc when a return from subroutine instruction is executed.

� retaddri: 32-bit return from interrupt address, used to save the pc while handling an

interruption, and also to restore the pc when a return from interruption instruction is

executed.

� sr: status register that contains the comparison �ag T and the current level of interrup-

tion.

6.4 DEVIL's Instruction Set

DEVIL's instruction set is based on a standard RISC instruction set, meaning that memory

operands can only be accessed using load/store instructions. Choosing a RISC-like approach

6.4 DEVIL's Instruction Set 57

shift

er ALU

shift

er ALU

Interrupt

Controller

PC

MARdata memory unit

address

code memory

data memory

data

macroreg.

unitcode memory

addr

ess

data

TT

right functional unitRU

left functional unitLU

regi

ster

file

16 3

2-bi

t reg

iste

rs

FETCH stage DECODE stage ALU/MEM/WB stage

disp

atch

er

deco

ders

Load/Store ops

ALU2 ops

ALU1 ops

Branch ops

32 64

32

32

Figure 6.3: Block diagram of the DEVIL architecture.

allows a simpler and faster pipeline, and simpli�es the introduction of a superscalar pipeline,

at the cost of a smaller code density.

In order to avoid this major drawback, DEVIL introduces a variable instruction length

mechanism similar to the one found in the ARM Thumb extension or in the TinyRISC. DEVIL

instructions can be either in 15-bit (short instruction) or in 30-bit format (large instruction).

Large instructions can encode large immediate values and o�er the possibility to specify a

destination register di�erent from the source. In short instructions the immediate value size is

limited and, for operations requiring two sources and one destination, the destination must be

the same as one of the sources.

The following subsections describe the features of DEVIL's instruction set. Appendix A

contains more detailed information about DEVIL's instructions.

6.4.1 Arithmetical Operations

DEVIL supports only simple 32-bit integer operations and does not include multiplication

and division instructions. The destination operand is always one of the 16 general purpose

registers, and the source operands can be either registers or immediate values. As a general

rule, short instructions use 5-bit immediate values and can only specify two operands, while

large instructions allow 16-bit immediate values and three operands. Furthermore, some large

operations can shift one source operand and be conditionally executed depending on the T �ag

with no overhead. Table A.1 describes DEVIL's ALU operations.


6.4.2 Logical Operations

Logical operations are described in Table A.2, and can be classi�ed in two categories: (1)

logical operations between one register and one immediate value; (2) logical operations between

registers.

The logical operations with immediate values are only available as large instructions. The

immediate values are 16-bit wide, which implies that such operations work on half-words (the

other half remaining unchanged). An instruction extension .l or .h indicates whether the opera-

tion applies to the least signi�cant, respectively the two most signi�cant bytes. These operations

are particularly useful for bit �eld manipulations.

Logical operations between registers can be speci�ed as either short or large operations.

In large operations three operands can be speci�ed, instead of two for short operations. Large

instructions can also shift one source operand and be conditionally executed depending on the

T �ag with no overhead.

6.4.3 Compare Operations

DEVIL's instruction set contains only �ve of the ten standard integer comparison operations.

The remaining �ve conditions are obtained by using the inverse of the comparison �ag (T).

For example, conditional branch instructions can jump either if the comparison is true or if

it is false, allowing all kinds of conditional jumps (see subsection 6.4.5). This method reduces

the number of comparison operations from 20 to 10, and is also used in the Motorola M�Core

family.

The result of a comparison operation is always stored in the T macro register. The com-

parison can be made between registers or between an immediate value and a register. Short

operations support 5-bit immediate values that could be signed or unsigned depending on the

type of comparison. Large operations support up to 20-bit immediate values. Furthermore, in

the 30-bit format, comparisons between registers can shift one operand and to be conditionally

executed.

In addition to these comparison instructions, there is also a bit test operation that copies

the tested bit in the T �ag. Table A.3 resumes DEVIL's comparison instructions.

6.4.4 Move Operations

Table A.4 resumes the set of move operations available in the DEVIL architecture. Obviously,

DEVIL has a standard mov operation. It is also possible to load a 6-bit or a 20-bit (depending

on the instruction size) signed immediate value into a register using the ldi instruction. If a

register needs to be loaded with an immediate value larger than 20 bits, the most signi�cant

part can be loaded with an ori.h operation (see subsection 6.4.2).

There is also a set of move instructions that allows data to be exchanged between the

register �le and the macro register �le, for example to allow the return address register for

example to be saved.

The conditional move operations add a partial-predication support to DEVIL, that can be

used to reduce the penalty due to branches and also, in some cases, to avoid the tail duplication

during superblock formation.

6.4.5 Branch Operations

Table A.5 describes the branch instructions of the DEVIL processor.

6.5 The DEVIL Instruction Fetch Mechanism 59

Branch targets can be speci�ed as a displacement relative to the Program Counter or as

the contents of a register. The displacement is a 10-bit signed value for the short instruction

format or a 25-bit signed value for the large instruction format.

The outcome of conditional branches depends on the value of the �ag T, and the branch

can be taken either when T is set or when T is cleared. This double state sensitivity is necessary

because DEVIL's compare operations only implement one half of all possible comparisons.

Furthermore, conditional branches specify whether or not instructions in the delay slot

have to be nulli�ed depending on the issue of the branch. Due to this nullify mechanism,

compiler static branch predictions can be done at a negligible hardware cost. Subsection 6.6.4

provides more information about the use, the e�ciency, and the negative e�ects of this static

branch prediction mechanism.

6.4.6 Data Memory Operations

DEVIL allows memory to be accessed only via load/store instructions. Table A.6 shows the

supported load/store operations, and their corresponding addressing modes.

Short operations allow only simple addressing modes: (1) the register indirect mode, that

uses the content of a register to address the data memory, and (2) the stack pointer relative

mode, that is used to address the elements that are in the stack frame. The memory location

is computed by adding a 5-bit displacement to the stack pointer. This mode is often used for

spill/�ll code.

Large operations extend the available addressing modes. The register indirect mode be-

comes a register indirect plus register o�set mode, and the stack pointer relative mode becomes

register plus displacement mode, where the displacement �eld is 16-bit wide. Furthermore, a

new mode is introduced that allows to address directly a data memory position with a label.

6.5 The DEVIL Instruction Fetch Mechanism

DEVIL's instruction fetch mechanism is designed to deliver the high level of performance of

a 2-issue processor while keeping code size and memory bandwidth to a minimum. To do so,

DEVIL supports a variable instruction length mechanism in conjunction with a NOP elimination

technique.

DEVIL fetches a 64-bit instruction bundle that is divided in �ve di�erent parts:

� tag: A 4-bit instruction tag that encodes instruction width information and instruction

scheduling.

� s0: 15 bits that can contain either a 15-bit instruction or the most signi�cant half of a

30-bit instruction.

� s1: 15 bits that can contain either a 15-bit instruction or the most signi�cant half of a

30-bit instruction.

� s2: 15 bits that can contain either a 15-bit instruction or the least signi�cant half of the

second 30-bit instruction in the bundle.

� s3: 15 bits that can contain either a 15-bit instruction or the least signi�cant half of the

�rst 30-bit instruction in the bundle.


This subdivision allows to encode in the instruction bundle a mix of short (15 bits) and

large (30 bits) instructions that can be executed at di�erent cycles. The 4-bit tag encodes the

size of the di�erent instructions as well as their scheduling information. Table 6.1 resumes the

di�erent mode of execution of a DEVIL's instruction bundle. For example, when the fetch unit

decodes Tag = 1011, it sends at time 0 a short instruction composed of the 15 bits of s0 to slot

0, then sends at time 1 one large instruction made with the concatenation of s1 and s3 to slot

0 in parallel with a short instruction composed of s2 to slot 1.

Tag Slot 0 Slot 1 Time

0000 s0 + s3 (large) s1 + s2 (large) 0

0001 s0 (short) nop 0

s1 + s3 (large) nop 1

s2 (short) nop 2

0010 s0 (short) s1 + s3 (large) 0

s2 (short) nop 1


s1 (short) nop 1



s1 (short) nop 1

s2 (short) nop 2

s3 (short) nop 3

0101 s0 (short) s1 (short) 0

s2 (short) nop 1

s3 (short) nop 2

0110 s0 + s3 (large) nop 0


0111 s0 + s3 (large) nop 0

s1 (short) s2 (short) 1

1000 s0 + s3 (large) nop 0

s1 (short) nop 1

s2 (short) nop 2



s3 (short) nop 2


s1 (short) s2 + s3 (large) 1


s1 + s3 (large) s2 (short) 1


s1 (short) nop 1



s1 (short) nop 1


1110 s0 (short) s1 (short) 0


1111 s0 + s3 (large) s1 (short) 0

s2 (short) nop 1

Table 6.1: Execution modes of DEVIL's instruction bundles.

6.6 DEVIL's Pipeline 61

Figure 6.4 shows how bundles can be formed from scheduled assembly code. This example

shows an interesting case that illustrates how alignment problems can be solved thanks to the

fact that short operations represent a subset of large operations. Another interesting fact is that

the next bundle is fetched only once all the operations of the current bundle are issued, meaning

that sometimes bundles should be �lled with NOPs so as to bundle together the operations

that are scheduled at the same time. When code size is more important than performance, this

constraint can be removed at the cost of decreased performance.

Extension

Promoted to large operation to fill bundle

instr. 2 (L)instr. 3 (S)instr. 2 (L)

instr. 4 (L)instr. 5 (L)

Extension

1111 instr. 7 (L)

instr. 4 (L)0000

instr. 9 (S)

Extension

instr. 1 (S)

instr. 5 (L)

0000 instr. 6 (L) instr. 6 (L)nop (L) nop (L)

instr. 7 (L)

Extension

tag s0 s1 s2 s3

0010

instr. 8 (S)

Insertion of a large nop operation to fill bundle

Slot 0

instr. 1 (S)

instr. 4 (L)

instr. 6 (L)

instr. 7 (L)

Slot 1

instr. 2 (L)

instr. 5 (S)

instr. 8 (S)

instr. 9 (S)

instr. 3 (S)

0

1

2

3

5

4

Sche

dul

ing

info

rma

tion

L = large operationS = short operation

Figure 6.4: Instruction bundle formation in the DEVIL processor.

6.6 DEVIL's Pipeline

DEVIL's architecture is based on a simple 3-stage pipeline. This choice was made to avoid the

extra logic and buses that are required to bypass operands in deeper pipelines. Indeed, in a

VLIW architecture several units can provide a result at each clock cycle and consequently the

bypass logic should be duplicated. In the case of the DEVIL architecture, three results can be

written at the same time, requiring three bypass subsystems. Using a 3-stage pipeline avoids

this circuit overhead.

Furthermore, deepening the pipeline in superscalar datapath increases the number of

penalty cycles for mispredicted branches, implying the need for a more e�cient (and thus more

complex) branch prediction mechanism. DEVIL has a simple branch prediction mechanism

(see 6.6.4) that sometimes requires code duplication. Increasing the delay slot size will result

in greater code expansion, probably forcing the addition of a Branch Target Bu�er (BTB).

The remainder of this section describes the pipelined execution of the di�erent types of

instructions.

6.6.1 Pipelined Execution for ALU Operations

The execution of an ALU operation is decomposed into three stages: Fetch, Decode, ALU-WB.

The instruction fetch occurs at cycle T1. Cycle T2 is used to decode the instruction and to read


the operands in the register �le. In the last cycle, T3, the ALU operation is executed and the

result is written into the register �le at the end of the second half of T3.

T1 T2 T3 T4 T5

CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LLL

Instr.1 ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB

Instr.2 ...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB

Instr.3 ...................�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB

Figure 6.5: DEVIL's pipeline: ALU operations.

6.6.2 Pipelined Execution for Memory Operations

Figure 6.6 shows the 4-cycle pipeline execution of a memory operation. Memory operations

require one cycle more than ALU operations because of the address computation. During cycle

T1 the instruction is fetched. In cycle T2 the instruction is decoded and the register �le is

accessed. Phase T3 is used to compute the address of the memory access. The memory access

and the writeback (in case of a load operation) are made in cycle T4.

T1 T2 T3 T4

CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�L

Mem op ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VV...Fetch Decode Addr Mem-WB

Instr.1 ...........�VVVVVV�VVVVVV�VVVVVV�VV...Fetch Decode ALU-WB

Figure 6.6: DEVIL's pipeline: memory operations.

This extra cycle of latency adds a load delay slot, meaning that instructions (e.g., Instr.1)

that immediately follow a load operation can not have the destination register of the load as

source operand. When it is not possible to move such an instruction in the delay slot, a NOP

should be inserted. Furthermore, the writeback of a load operation is made at the same time

than instructions scheduled in the next cycle, resulting in a potential resource con�ict or the

need to add an extra register �le write port. This latter solution has been used in the DEVIL

implementation.

6.6 DEVIL's Pipeline 63

6.6.3 Pipelined Execution for Branch Operations

Branch operations have a three-cycle execution time. In the �rst stage the branch instruction

is fetched. Then, during decoding, the next PC is computed, allowing the correct instruction

fetch to be executed in the following cycle. Therefore, there is a one-cycle branch delay slot,

implying a branch misprediction penalty of one cycle. A last phase is used to save the PC in

the retaddr macro register when a jump subroutine is executed.

T1 T2 T3 T4 T5 T6

CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LLL

test.cc ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB

jt/jnt_nn/nt...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Dec-PC Save PC

Instr.3 ...................�UUUUUU�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB

Instr.4 ...........................�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB

Figure 6.7: DEVIL's pipeline: conditional branch operations.

Figure 6.7 shows the pipelined execution of a conditional branch instruction. At cycle T1a comparison instruction is fetched and used to compute the result of the comparison during

cycle T3. The conditional branch instruction is fetched at cycle T2, and during cycle T3 the

branch operation is decoded and the new PC is computed according to the comparison's result

(computed in parallel). Furthermore, during this phase of execution the processor decides

whether it should nullify the instruction fetched during T3 or not. This decision is made

according to the comparison's result and the branch prediction information (_nt= nullify

taken, _nn= nullify not taken). Finally, during cycle T4 the PC can be saved into the retaddr

macro register if needed.

6.6.4 DEVIL's Branch Prediction Mechanism

DEVIL o�ers a simple mechanism for static branch prediction, allowing the conditional execu-

tion of instructions in the branch delay slot. The conditional branch instruction format allows

to specify whether the operations in the delay slot should be nulli�ed when the branch is taken

or when the branch is not taken. Figure 6.8 shows how the compiler can do branch prediction

using this mechanism and pro�ling information.

Figure 6.9 shows the bene�ts of DEVIL's compile-time branch predictor in terms of per-

formance. Although this branch prediction technique has a negligible hardware overhead, the

major drawback is that, when the branch is predicted taken, the compiler should duplicate code

into the delay slot, resulting in code expansion.

This code expansion is due to the delay slot inserted by the branch operations. This delay

slot can be avoided if a Branch Target Bu�er (BTB) is added to the branch unit. The BTB


st.32 r1, r2, r0add r0, r0, #1

jt_nn beginning

ld r1, #base_arrayldi r0, #0

shl r2, r0, #2

shl r2, r0, #2

rts

Nullified whentest.lt r0, #128

rts

not taken

shl r2, r0, #2add r0, r0, #1st.32 r1, r2, r0test.lt r0, #128

jt_nt beginning

ld r1, #base_arrayldi r0, #0

Nullified whentaken

Figure 6.8: DEVIL's branch prediction mechanism.

Figure 6.9: DEVIL's compile-time branch prediction bene�ts.

is an associative cache that stores the addresses of the branches and their predicted outcome.

The �rst time a branch operation is fetched, the branch address will not match any entry of

the BTB, and the next sequential instructions are fetched. Once the address destinations are

known, the compile-time predicted address is stored in the BTB with the corresponding branch

address. The next time the branch is fetched, the BTB will match the branch address with one

of its entries and will return the predicted next address, avoiding the delay slot.

6.7 Evaluation of the DEVIL Architecture

This section contains an evaluation of the DEVIL processor in terms of both performance and

code size utilization.

6.7 Evaluation of the DEVIL Architecture 65

6.7.1 Experimental Setup

The IMPACT framework [7] was used to obtain accurate estimates of the processor's perfor-

mance, code size and memory utilization. IMPACT is a compiler framework developed at

the University of Illinois at Urbana-Champaign to study the new generation of ILP compilers.

Figure 6.10 shows the block diagram of the IMPACT framework. There are �ve main parts:

(1) the front-end, (2) the machine-independent optimizer, (3) the back-end, (4) the machine

description, and (5) the emulator/simulator.

Emulator Simulator

Statistics

Pcode

Hcode

C program

HtoL

Fro

nt-E

nd

LcodeMachine

IndependentOptimizer

Back-endDescriptionMachine

Assembly code

Memory DesambiguationProfiling

Inlining

Standard OptimizationsSuperscalar OptimizationsSuperblock FormationHyperblock FormationProfiling

Phase 1: Map Lcode to Machine Instructions

Register AllocationInstruction Scheduling

Phase 3: Assembly Code Generation

Machine Dependent OptimizationsPhase 2:

Figure 6.10: The IMPACT compiler framework.

The front-end translates a program written in C into an intermediate representation called

Lcode. The C program is �rst converted into Pcode, another intermediate format, that is used

to do a �rst pro�ling, memory disambiguation, array analysis, and code inlining. Once such

steps are done, the Pcode is converted to Hcode and �nally to Lcode via the HtoL converter.

Lcode is the internal representation used in the machine-independent optimizer and it

looks like an extended RISC-like instruction set. The machine-independent optimizer includes

all the standard compiler optimizations plus a large set of superscalar optimizations such as

superblock formation, hyperblock formation (i.e. predication), loop unrolling, etc. Lcode can

also be pro�led using the Lcode emulator (Lemulate). At this level, pro�ling plays an important

role because the majority of the optimizations use pro�ling informations.

Once the Lcode is optimized, the back-end converts Lcode into a machine dependent

assembly language using a three-phase algorithm. The �rst phase annotates Lcode into assembly

instructions that are compatible with the targeted machine. The second phase consist of the

register allocator, the scheduler, and the machine dependent code optimizer. Finally, the third

phase generates the assembly �le.

The machine description describes the targeted architecture in terms of instruction operands,


Benchmarks Description Benchmark Suite

008.espresso Generates and optimizes Programmable Logic Arrays SpecINT92

023.eqntott Translates a logical representation of a Boolean equation to a truth table

052.alvinn Trains a neural network using back propagation

129.compress Compresses and decompresses �le in memory SpecINT95

130.li LISP interpreter

132.ijpeg Graphic compression and decompression

decode CCITT G.711, G.721 and G.723 voice compressions decoder Mediabench

encode CCITT G.711, G.721 and G.723 voice compressions encoder

gsmencode GSM 06.10 provisional standard for full-rate speech transcoding

mpeg2dec Video MPEG-2 decoder

mpeg2enc Video MPEG-2 encoder

rawcaudio ADPCM speech compression algorithm

rawdaudio ADPCM speech decompression algorithm

dhrystone Dhrystone v2.1 Dhrystone

�b Compute Fibonacci numbers

�r FIR �lter

wc Word Count UNIX utility

Table 6.2: Benchmark list

instruction latencies, resource utilization, and pipeline execution. Such information is required

in particular for code scheduling.

The emulator and the simulator are used for code pro�ling and to extract statistics such

as performance and memory utilization. The emulator can probe the code for pro�ling or to

generate an execution trace that can be sent to the trace-driven simulator.

All these IMPACT modules are widely parameterizable via the use of parameter �les,

allowing the di�erent compiler functions to be enabled/disabled as well the di�erent parameters

to be �ne-tuned.

The IMPACT framework was enhanced to generate code for the DEVIL instruction set,

so as to evaluate the impact of the di�erent architectural choices (the parts that were modi�ed

are shaded in Figure 6.10). At the front-end level, the HtoL converter has been modi�ed to

generate library function calls for the unsupported operations such as �oating point operations

or integer division. Also, a new back-end has been built to generate optimized code for DEVIL,

including several machine descriptions. Furthermore, the Lcode emulator has been modi�ed to

emulate DEVIL's code.

All the results presented in this chapter are derived from information obtained through

dynamic emulation of the code. Note also that the conditional execution modes and free-shifting

operand possibilities were not used, meaning that the compiler can potentially generate better

code in terms of both performance and memory utilization.

6.7.2 Benchmarks

All the results presented is this chapter were obtained on a selection of programs extracted

mainly from the the SpecINT92, SpecINT95, and Mediabench [35] benchmark suites. These

di�erent types of benchmarks were chosen as representative for a wide range of applications.

Table 6.2 brie�y describes this selection of benchmarks. The Mediabench benchmarks represent

multimedia programs that can be found in embedded systems (e.g. GSM encoder), while the

SpecINT suites represent non-numerical applications. Some other smaller applications were

added, such as FIR �lter, used in several embedded programs. This variety of applications

shows the sensitivity of the results to the type of program.


6.7.3 DEVIL's Performance

Figure 6.11 shows the performance of the DEVIL processor without superscalar optimization

(DEVIL O) and with superscalar optimization (DEVIL S), as well as performance of a 4-

issue processor that executes the DEVIL's instruction set and that has one branch unit, one

load/store unit, and two ALUs. Superscalar optimizations were used to generate the code for

the 4-issue processor. All results are relative to the best performance that can be achieved with

a single-issue processor that executes the DEVIL instruction set.

Figure 6.11: DEVIL performance with and without superscalar optimizations compared to

1-issue and 4-issue architectures.

E�ect of superscalar optimizations � Figure 6.11 shows the importance of super-

scalar optimization to extract parallelism from programs. On average, DEVIL with superscalar

optimization is 30% faster than without superscalar optimization, meaning that superscalar

optimizations must be used to achieve a signi�cant speed-up. However, as will be shown later,

such optimizations have a negative e�ect on code size.

DEVIL's speed-up � The multiple-issue pipeline introduced in the DEVIL architecture

increases the performance from 29% to 78% (50% in average) with respect to a scalar machine

using the same instruction set. This speed-up will allow a signi�cant voltage and clock frequency

reduction. These bene�ts are investigated in Chapter 7.

E�ect of the limitation of the number of issued operations per cycle � The

DEVIL instruction fetch mechanism limits the number of issued instructions per cycle to two,

even if DEVIL contains four units. This choice was mainly motivated by code compaction

issues. Figure 6.11 shows that reducing the number of issued instructions from four to two

reduces the performance of 5% on average.

It is interesting to note that there is no signi�cant change in the results between the

di�erent benchmark suites, meaning that the results are not sensitive to the type of application.

The performance comparison graph (Figure 6.11) illustrates the need to use superscalar

optimizations in order to extract a good level of performance.


Figure 6.12: E�ect of superscalar optimizations on code size.

Figure 6.13: E�ect of superscalar optimizations on the number of accesses to the code memory.


Figures 6.12 and 6.13 depict how the memory utilization is a�ected by this kind of op-

timizations. The increase in code size due to superscalar optimization is of 58% on average,

the worst case being 052.alvinn that exhibits an increase in code size by a factor of 3.4. On

average, superscalar optimizations do not modify signi�cantly the number of accesses to the

code memory.

6.7.4 DEVIL's memory utilization

The previous subsection showed that the use of VLIW machine implies a penalty in terms of

memory utilization. In order to reduce this negative e�ect, DEVIL's architecture o�ers an

instruction fetch mechanism that includes NOP elimination and variable instruction length

support. This subsection quanti�es the e�ect of such techniques on memory utilization. Note

that an important feature of these mechanisms is that they do not a�ect performance.

Figures 6.14 and 6.15 show the bene�ts, in terms of code size and number memory accesses,

of the NOP elimination technique. To obtain these measures, the variable instruction length

mechanism was disabled so that only the large operations of the DEVIL's instruction set were

available. The targeted architecture is a 2-issue machine that fetches a 64-bit instruction word

including two DEVIL's large operations and a tag that encodes the scheduling information

necessary for the NOP elimination. The �gures are relative to the same 2-issue machine without

NOP elimination support.

Figure 6.14 shows that the NOP elimination mechanism reduces the code size by 27% on

average. Figure 6.15 shows that the number of accesses to the code memory is decreased by 20%

on average. These results show the importance of eliminating unnecessary NOP instructions.

Figure 6.14: E�ect of NOP elimination on code size.

The same kind of experiment was run to quantify the e�ciency of the variable instruction

length mechanism. To obtain this results, a 2-issue machine that includes the NOP elimination


Figure 6.15: E�ect of NOP elimination on the number of accesses to the code memory.

technique but can only execute DEVIL's large operations is compared to the DEVIL processor,

that includes both NOP elimination and variable-length mechanism.

Figures 6.16 and 6.17 summarize the results that were obtained and show that the variable

instruction length mechanism allows a saving of 20% to 30% (26% on average) of the code size.

Furthermore, the number of accesses to the code memory is reduced by 20%.

To summarize the e�ciency of the DEVIL instruction fetch mechanism, Figures 6.18 and 6.19

show the memory utilization of the DEVIL processor as compared to a 2-issue machine with no

NOP elimination and with the ability to execute only DEVIL's large operations. All the results

presented here are relative to the code size of a scalar architecture that executes only DEVIL's

large operations.

These results show that DEVIL has a code size on average 22% smaller than a scalar

processor that executes DEVIL's large operations. Compared to the standard 2-issue VLIW,

the DEVIL instruction fetch mechanism allows to save 47% of code size on average.

Figure 6.19 shows the number of accesses to the code memory. As the number of accesses to

the code memory is independent of the bus width, these numbers should be weighted, knowing

that the bus of the scalar processor is 32-bit wide and the buses of the VLIW architectures is

64-bit wide. The comparison between the two VLIW architectures shows that DEVIL's fetch

mechanism allows a reduction of the number of accesses by 36% on average. As compared to

the scalar processor, DEVIL's number of accesses to the code memory decreases from 50% to

75%, but the DEVIL's instruction width is twice as large as the scalar processor's. Therefore,

DEVIL has an average reduction of 16% in terms of number of accessed bytes.


Figure 6.16: E�ect of the variable instruction length mechanism on code size.

Figure 6.17: E�ect of the variable instruction length mechanism on number of accesses to the

code memory.


Figure 6.18: E�ect of the DEVIL instruction fetch mechanism on the code size.

Figure 6.19: E�ect of the DEVIL instruction fetch mechanism on the number of accesses to the

code memory.

6.8 Comparison With Existing Mobile Processors 73

6.8 Comparison With Existing Mobile Processors

In this section we wish to compare DEVIL and existing mobile processors in terms of instruction

set and code size. Such a comparison is of course di�cult, as DEVIL is currently only a

prototype and can be much more optimized. Furthermore, the tools (i.e., the compiler) used for

generating code are not the same for each processor, meaning that the quality of the generated

code depends not only on the architectural features but also on the quality of tools. Therefore,

the goal of this section is not to once and for all whether DEVIL is better than other processors,

since the comparison is not fair at this stage. However, this analysis provides an insight on how

DEVIL's features can be situated with respect to current processors. Also, this comparison

allows to highlight the original points of the DEVIL architecture.

6.8.1 Instruction Set Comparison

Compared to the state of the art in mobile processors (see Chapter 4), DEVIL o�ers several

new features. First of all, DEVIL bundles explicitly encoded parallel operations, while current

mobile processors o�er only sequential instruction representation. Even in the SH-4 architecture

instructions are still sequential and the parallelization is made by a hardware scheduler, implying

a large hardware overhead. Second, DEVIL has a variable operation length encoding that

supports 15-bit and 30-bit instruction lengths. This technique is similar to the ARM Thumb or

TinyRisc operation encoding, but there are several di�erences, and notably that performance is

not decreased compared to a processor with a �xed instruction set. ARM and TinyRisc reduce

the code size at the cost of degrading performance. This is mainly due to the fact that DEVIL

allows the mixture of short and large operations with no restrictions, while with Thumb or

TinyRisc the processor has to choose between executing large or short instructions. A special

branch operation controls the mode of execution. Also, DEVIL's short instructions can access

all of the 16 registers, while in the short execution mode Thumb and TinyRisc allow only a

subset of the entire set of registers to be accessed.

Existing and future VLIW architectures also o�er an instruction fetch mechanism that

compacts the VLIW instruction word. The HP/Intel IA-64 includes a bundle formation mech-

anism that explicitly encodes parallelism into the instruction bundle, eliminating the NOP

insertion required in original VLIW architecture. The TMS320C6201 has a similar mechanism.

Both solution are implemented for a �xed instruction length. DEVIL extends this concept by

introducing a variable instruction length encoding within a bundle. The future DSP StarCore,

will use an approach similar to DEVIL's, with 16-bit instructions in conjunction with instruc-

tion pre�xing (allowing to extend instruction length), and with parallelism encoded within

an instruction packet. However, no precise information is available to date on how this fetch

mechanism works. This industrial development, however, supports our design decision.

6.8.2 Code Size Comparison

Figure 6.20 shows a comparison of the relative average code density of several processors for

the benchmarks described in section 6.7.2. The results presented for the market processors

were generated using the GNU gcc compiler with level 3 optimization (-O3). The results for

DEVIL are generated with the IMPACT compiler. The results correspond to the average code

expansion of the processors compared with the code generated for the DEVIL architecture

without superscalar optimization (i.e. devil (O)). The devil (S) measurement shows the code

size when superscalar optimizations were used.


The DEVIL variable length instruction set allows the compiler to generate quite compact

code. Indeed, the code size of DEVIL, when the compiler does not apply superscalar optimiza-

tions, is around 18% better than the ARM7, 30% better than the i386, and 10% better than

SH. However, DEVIL's code is about 20% to 25% larger than the code of Thumb and M-core.

The larger code can be explained by the bundle �lling that is required to group instructions into

a single bundle. These results show that the DEVIL instruction set is well designed in terms of

code density. It should be noted that the IMPACT compiler was not optimized for minimum

code size and that, for the moment, the conditional operation and free-shifting operands are

not used. Therefore, future compiler development may further improve these results. Another

important point is that DEVIL lacks the move multiple1 operation that can potentially save

code size when applied to spill and �ll code insertion.

Figure 6.20: Code size comparison between DEVIL and some other mobile processors.

When the compiler applies superscalar optimizations, there is a 58% increase in code size.

This result illustrates the cost of using a VLIW-like architecture. Note that, this increase in

code size due to superscalar optimizations can be reduced, at the cost of decreased performance.

6.9 Conclusion

This chapter de�ned a new VLIW architecture called DEVIL, targeted for the mobile processor

market. The architectural decisions were motivated and evaluated with an enhanced IMPACT

compiler. DEVIL o�ers an instruction fetch mechanism that allows to encode explicitly the

parallelism within an instruction bundle and to support variable instruction lengths. It was

shown that such mechanism allows savings of 50% of the code size with respect to a standard

VLIW processor, with no impact on performance. A signi�cant reduction of the number of

accesses to the code memory was also observed.

In terms of performance, DEVIL speeds up the execution by a factor of 1.5 on average as

compared to a scalar processor. This performance enhancement allows lower frequencies and

power supply voltages to be used, reducing the circuit's power consumption.

1This operation allows to specify several load/store registers in the stack frame in one unique instruction.

6.9 Conclusion 75

A comparison was made between DEVIL and current mobile processors, in order to roughly

determine where the DEVIL features can be situated. DEVIL o�ers an instruction set that

allows a good code density, while o�ering a parallel operation representation. However, when

superscalar optimizations are used, there is a large code size penalty. The e�ects of code size

expansion are minimized thanks to the compaction technique o�ered by the DEVIL architecture.

The next chapter describes the VLSI implementation of the DEVIL processor, allowing a

good estimation of its features in terms of complexity, circuit speed, and power consumption.

Chapter 7

Implementation of

the DEVIL Processor

The DEVIL processor has been de�ned in the previous chapter and has been evaluated at

the architectural level. It was shown that the introduction of the parallelism speeds up

the execution time by a factor of 1.5 on average as compared to the scalar architecture. This

speed-up can be used to compensate the loss of performance due to a low-power execution

mode (i.e. low clock frequency and low power supply). However, the hardware cost due to

the introduction of a multiple issue pipeline has not been evaluated. This potential increase in

complexity could nullify the bene�ts of parallelism.

This chapter describes the implementation of the DEVIL processor in order to estimate

design features such as complexity, circuit speed, and circuit power consumption. The DEVIL

processor was implemented using a hardware description language and synthesized with a low-

power technology. The following section gives details on the design methodology, the DEVIL

implementation, and the DEVIL features.

7.1 Technology and Synthesis Methodology

The DEVIL processor was implemented using the VHDL hardware description language and

was synthesized using the Synopsys 1998.08 tool. The synthesis targeted the CSL 4.1 low-

power library developed by XEMICS1, characterized for circuit delays and power consumption

estimation at 1.6 volts slow-slow (worst case), and is mapped on a TSMC 0.25 �m technology.

Synthesis methodology approaches have several advantages: reduced design time, fewer

resource requirements, quick migration to di�erent technologies, and the possibility to market

the design as intellectual property (IP), which is the current trend in the market. However, J.

Scott and al. [57] showed that when moving from a custom to a synthesized adder, transistor

count increased by 60%, area increased by 175%, and power consumption increased by 40%.

To counter these e�ects, the VHDL description of DEVIL was made at a low level, close to

a structural description. Nevertheless, it should be noted that a full custom design can be

optimized much more thoroughly.

1http://www.xemics.ch

77

78

Implementation ofthe DEVIL Processor

7.2 Design Methodology

DEVIL targets the low-power mobile processor market and aims to be used as an ASIC core.

This implies a simple and fast synthesis methodology, and the possibility to work at di�erent

power supply voltages in order to meet power consumption requirements by working at a low

voltage.

One of the most sensitive elements in a microprocessor design is generally the system-wide

clock tree that must meet strong timing constraints in order to avoid clock skew problems. This

phenomenon gets worse with deep submicron technologies, generally requiring an optimization

of the clock tree by hand and huge clock line bu�ers that increase power consumption. For

example, in the M-core [57], clock power represents around 36% of the total power dissipation.

These design constraints are directly opposed to the goals of the DEVIL project. To address

these issues, the DEVIL implementation is based on a non-overlapping dual-phase system clock

used in conjunction with latches and an aggressive clock gating. DEVIL does not contain any

�ip-�op elements.

A dual phase, non-overlapping system clock is the most robust scheme available to avoid

system-wide clock skew problems. It is always possible to �nd a clock frequency for which the

design works correctly [77], even at di�erent power supply voltages. In a conventional single-

clock and �ip-�op system, a design working at 3 volts may not work at 2 volts because of clock

skew problem even if the clock frequency is scaled down. Therefore, dual phase, non-overlapping

clock systems o�er a great advantage for IP, where the core must be synthesizable for di�erent

applications and power supply voltages, and reduce the need of huge clock line drivers to meet

clock timing requirements in conventional designs.

Gating clocks is an e�cient way to save power: every unnecessary, power consuming

signal transition can be prevented. This approach is quite e�cient for large buses in the chip's

datapath and is particularly adapted to VLIW architectures that include duplicated datapaths

with signi�cant idle times. Furthermore, gated latch techniques can be easily integrated in a

dual phase, non-overlapping system.

Figure 7.1 illustrates the DEVIL system clock based on dual phase, non-overlapping clocks.

Figure 7.1(a) shows how the two non-overlapping clocks CLK1 and CLK2 are built from a clock

signal (Fast CLK) that has a frequency double that of the original pipeline clock (Orig. CLK).

When a clock skew problem appears, it can be solved by decreasing the frequency of Orig. CLK.

This results in an increase of the non-overlapping time TNOT , i.e., of the clock skew tolerance.

With this clocking scheme the DEVIL 3-stage pipeline is divided in two substages that are

separated by latches. These latches are synchronized on CLK2, while the inputs of each stage

are synchronized on CLK1. Each substage has now a maximum critical path included between

one quarter of the Orig. CLK cycle (TMIN ) and three quarters of the Orig. CLK cycle (TMAX),

depending on whether the substage can borrow time to its neighbors (see Figure 7.1(b)). This

can result in a better balance of the pipeline timing.

Figure 7.1(b) shows the implementation of a pipeline stage. The inputs of the �rst pipeline

substage (the Latch1 output) are guaranteed to have settled by the time CLK1 goes low. The

outputs of that block must have settled by the time CLK2 goes low for the proper values to be

stored in the Latch2. When Latch 2 is open (note that when Latch2 is open, Latch1 is always

closed), the second substage begins its computation. Figure 7.1(b) also shows the clock gating

implementation where a given pipeline stage can be controlled by the previous stage. In the

DEVIL pipeline, for example, a dirty bit is used to indicate if instructions in the pipeline are

valid or not. When the instruction is not valid (pipeline bubble), this dirty bit directly gates

clocks of the next pipeline stage.

7.3 The DEVIL Latch-Based Pipeline 79

LogicBlock

LogicControl

Latch1 Latch3Latch2

CLK1 CLK2 CLK1

gate2 gate3gate1

Tmax

Tmin

LogicControl

BlockLogic

Orig CLK

Fast CLK

Fetch0 Fetch1 Dec0 Dec1 Exec0 Exec1

WB

CLK1

CLK2

(b)

(a)

non-overlapping time

Figure 7.1: A two-phase non overlapping pipeline using latches.

Designing a latch based architecture is quite unconventional and the design has to be care-

fully conceived from the bottom up as a dual-clock latch-based design. It is not recommended

to simply transform a register-based design into a latch-based design by replacing each register

by two latches.

7.3 The DEVIL Latch-Based Pipeline

The DEVIL's pipeline is mapped to the double clock structure by dividing each of its pipeline

stages into two functional parts. Figure 7.2 shows the execution of several instructions in the

DEVIL dual-phase pipeline.

The �rst group of operations (� and �) illustrates the execution of two consecutive ALU

operations. The writeback of the result of operation � is made at the same time as operation �

requires its source operands. In some cases this requires the data to be bypassed. However,

as DEVIL is based on a latch implementation, the bypass is made directly through the latch

elements of the register �le, avoiding any kind of bypass logic.

Operations � to � illustrate the execution of a conditional branch instruction. Instruc-

tion � computes a comparison and stores the result in the T �ag. The conditional branch is

fetched right after operation �. During the phase DEC0 of operation �, the conditional branch

is detected and the information is sent to the DEC1 stage (PC1). The DEC1 stage will set the

PC according to the branch outcome de�ned by the value of the T �ag.

The remaining operations illustrate a pipeline stall due to a data memory access that

requires a one-cycle wait state (operation �).

80


DEC1

FETCH1PC0

DEC0

FETCH0

PC1

ALU0

WB

ALU1

ALU0

DEC1

DEC0

PC0FETCH0

FETCH1PC1

DEC0

FETCH1PC1

FETCH0PC0

WB

ALU1

ALU0

DEC1

DEC0

PC0FETCH0

FETCH1

WB

ALU1

ALU0

DEC1

DEC0

PC0FETCH0

FETCH1PC1

DEC1

DEC0

PC0FETCH0

FETCH1PC1

ALU1

PC1

WB

ALU1

ALU0

DEC1

DEC0

PC0FETCH0

FETCH1PC1

DEC1

DEC0

PC0FETCH0

FETCH1PC1

ALU0

ALU1

WB

MEM1

DEC1

DEC0

PC0FETCH0

FETCH1PC1

MEM0

MEM2

MEM3

WB

��

��DEC1

DEC0

PC0FETCH0

FETCH1PC1

��

��

WB

Bypass

gcDISP

DISP

PCsel

KILL

Bypass of T

WB stall

1

2

3

4

Orig CLK

CLK1

CLK2

Two consecutive ALU operations (1,2)

ALU operation (3) followed by a conditional branch (4)

5

6

7

8

9

10

Memory access (7), ALU operations (8) to (10)

Figure 7.2: DEVIL's pipeline implementation with non-overlapping clocks.

7.4 DEVIL Implementation

7.4.1 DEVIL's Datapath

Figure 7.3 shows the datapath of the DEVIL processor. DEVIL has a Harvard architecture

(i.e., separated data and code memory). The interface with the data memory is 32-bit wide,

while the code memory is 64-bit wide. The instruction fetch in memory occurs during the

�rst phase of the Fetch stage (F0) and the 64-bit instruction bundle is stored in an instruction

register (IR) on CLK2. During the second phase of the Fetch (F1), a state machine uses the

tag information to control two instruction dispatchers that send operations to one of the four

functional units. The instruction dispatchers work in parallel and can send operations to a

subset of the functional units. One dispatcher sends operations to the branch unit and the

ALU1, while the other sends operations to the ALU2 and the load/store unit. This subdivision

simpli�es the dispatcher implementation without restricting parallelism. Note that the dispatch

is made to four separate functional unit datapaths to e�ciently gate the datapaths that are not

used.

The decode stage (D0, D1) consists of four decoders that decode operations and read the

register operands. This latter operation occurs in the second phase to avoid spurious power-

consuming reads. The register �le has four read ports. The branch decoder has the particularity

of executing the branch operations.

The execution stage (E0 to E3) is composed of four units that can generate a result. These

results are forwarded to the register �le via a writeback unit. DEVIL issues two instructions

per cycle. In order to support this operation throughput, the register �le has three write ports.

The third write port is required because the load operations have latencies 1-cycle greater than

other units, meaning that three instructions can be completed at the same time. The destination

storage elements of the register �le are sensitive to CLK1.

7.4 DEVIL Implementation 81

Execute

Load/Store

ALU2AddressExecute

Load/Store

Write Back

Execute

Branch ALU1Save Regs

&

Register

Inte

rrup

tion

Ha

ndlin

g

Regiters

File

Macro

BranchExecute

Load/StoreDecoder

ALU2DecoderDecoder

ALU1

Code Memory

DataMemory

64 bits

32 b

its

DEVILF1F0

D0-

D1

E0-E

1-E2

-E3

Slot 0 Slot 1

Figure 7.3: DEVIL datapath block diagram.

7.4.2 Fetch and Dispatch Unit

Figure 7.4 shows the block diagram of the DEVIL dispatch unit that corresponds to the second

substage of the Fetch pipeline stage (F1). DEVIL fetches 64-bit instruction bundles composed

of 5 di�erent parts (tag, s0, s1, s2, s3). The four-bit tag is used by a simple 4-state �nite state

machine (FSM) that controls four groups of multiplexers. The four states of the FSM map the

maximum of 4-cycle bundle execution time. The four groups of multiplexers are subdivided

into two slots (Slot 0 and Slot 1). Each of these slots can send a 15-bit or a 30-bit operation

toward a subset of two of the four functional units.

s0 (15 bits)

SequencerInstruction

ALU2 + Load/StoreBranch +ALU1Slot 1Slot 0

extensioninstruction15-bit

s1 (15 bits)s2 (15 bits)

64-bitbundle

15-bitinstruction

tag (4 bits)

s3 (15 bits)

15-bit 15-bitextension

Figure 7.4: Fetch and dispatch datapath.

7.4.3 Program Counter Datapath

Figure 7.5 shows the datapath that computes the program counter, i.e., the core of the branch

functional unit. The �rst substage of the program counter datapath increments the PC (PC+1

82


and PC+displacement), while the second stage selects the new PC among, for example, the two

possible outcomes of a conditional branch operation. The main particularity of the program

counter datapath is the duplication of the PC required to gate the transition of the 32-bit adder

that computes the PC-relative addresses.

FL00L FL01L

gcPCDispSmall

pcDispSmall[9:0]

pcDispLarge[14:0]

pcAddLatch

pcIncrLatch

clk1

clk1

clk1

clk1

64-bit memory accessalignment

clk2

clk2

selPCLatch clk2

gcPCDispLarge

codeMemAddr[31:0]32

PC = PC + 1 + Disp32

sig

nExt

PCD

isp

Ad

dr

32

inc

rPC

pc

Ad

de

r

+

pcDispLatchSmall

pcDispLatchLarge

gc

Pro

gra

mC

oun

ter2

signExtPCDisp

gc

PCA

dd

gc

PCIn

cr

retAddr[31:0]

retAddrI[31:0]

interruptAddr[31:0]

registerAddr[31:0]

selPCSource[2:0]

gcSelPC

0

10

15

programCounter1

programCounter2

selPC

5 0

3

gc

Pro

gra

mC

oun

ter1

codeMemAddrUnit.vhd

Figure 7.5: Program counter datapath.

7.4.4 Register File

The register �le (Figure 7.6) contains sixteen 32-bit registers implementedwith latches. Register

r15 contains the stack pointer (SP), but it can be also used as a general purpose register. The

register �le has three input ports and four output ports. The latch implementation means that

the bypass logic is free (see Figure 7.2 operations � and �).

7.4.5 Arithmetic and Logic Unit

Figure 7.7 illustrates the datapath of the ALU. The �rst ALU substage integrates a barrel

shifter that can shift one operand up to 32 positions in either direction with or without sign-

extension. Furthermore, a logic operation module �lters the operands in order to implement

the di�erent ALU functionalities. The second substage always executes an addition of the

two operands modi�ed in the previous substage. For example, a subtraction is performed

by inverting operand B, in the logic block, and forcing the input carry of the adder to one

(A�B = A+NOT (B) + 1).

In parallel to the ALU, there is a �ow through unit that allows to execute move operations

at low power cost.

7.4.6 Load/Store Unit

The Load/Store execution unit (see Figure 7.8) is implemented on two pipeline stages, corre-

sponding to four substages. In the �rst substage the data memory address is computed by an

adder. The remaining substages are used to perform the memory access. The last substage also

includes a cast and alignment mechanism to support di�erent size accesses.

7.4 DEVIL Implementation 83

regFile.vhd

muxOutRegFile

clk1

clk1

clk1

clk1

clk1

clk1

clk1 clk1 clk1 clk1 clk1 clk1 clk1 clk1 clk1 clk1

selSrc3[3:0]

selSrc2[3:0]

reg15reg14reg13reg12reg11reg10reg9reg8reg7reg6reg5reg4reg3

regSrc0[31:0] regSrc1[31:0] regSrc2[31:0] regSrc3[31:0]

selSrc0[3:0]

selSrc1[3:0]

reg0 reg1 reg2

decSourceRegFile

selDest1[3:0]

selDest0[3:0]

selDest1[3:0]selDest2[3:0]

decGCRegFile

selDest0[3:0] selDest2[3:0] regIn0[31:0] regIn2[31:0]regIn1[31:0]

enWR0enWR1enWR2

Figure 7.6: Register �le

ALU0

ALU1

aluOpA[31:0] aluOpB[31:0]

clk1 clk1

alu0ALatch alu0BLatch

and/or/xor/not

gcAlu1BLatch

clk2 clk2

alu1BLatch

aluadd aluFlag

aluOperation[3:0]

gcAlu1OpCLatchclk2

barrelshiftershl/shr/ashr

logicUnit

gcAlu0ALatch gcAlu0BLatch

alu1OpCLatch

aluOut[31:0]aluControlOut[5:0]

clk2gcAlu1CLatch

alu1CLatch

clk1 clk1 clk1gc

Alu

0CLa

tch

alu

Co

ntro

lIn[5

:0]

alu

Op

era

tion[

3:0]

Shift

erC

ont

rol[6

:0]

gc

Alu

0Op

CLa

tch

gc

Alu

0Co

ntro

lLa

tch

alu0OpCLatchalu0CLatch alu0ControlLatch

opShifter[4:0]modeShifter[1:0]

logicOperation[3:0]

alu1ALatch

gcAlu1ALatch

control_funcUnit.vhd funcUnitDatapath.vhd

funcUnit.vhd

Figure 7.7: ALU datapath.

84


MEM0

MEM1

1 0

mem0DisplAddr

MEM2

MEM3

mem1Accessmem1Control mem1RndWr

opMemDispl

gcMem1Data

gcMem1Addr

opMemDispl

gcMem1Seln_stall

stall

stall

dm

Co

ntro

lIn

dm

Rea

dnW

rite

dm

Ca

lcD

isp

l

dm

Do

Me

mA

cc

ess

clk1

mem0Control mem0Access mem0RdnWr mem0Displ

n_st

all

clk1n_st

all

n_st

all

clk1 clk1n_st

all

gcMem0Data

gcMemNoDisplAddr

gcMemDisplAddr

n_stall

clk2n_st

all

clk2 clk2n_st

all

n_st

all

clk1n_st

all

n_st

all

clk1n_st

all

clk1

gcMem1Wait

clk2

clk2

clk1

+

clk1 clk1 clk1 clk1

mem0BaseAddr mem0DisplAddr mem0Addr mem0Data

mem0Select mem1RdnWr mem1Addr mem1Data

clk2 clk2clk2

dmBaseAddr dmDisplAddr dmNoDisplAddr dmDataIn

dmbDataOutdmbAddressdmbRndWrdmbSelectdmbWait

mem2Wait

dmbDataIn

dmDataOut

clk2

datamemunit.vhd

datamemunitdatapath.vhd

gcMem3Data

opWaitNeeded

control_datamemunit.vhd

dmControlOutdmStallOut

stall

mem2Access

n_st

all

mem2Control mem2RndWr

clk2

mem3Control

Figure 7.8: Data Memory Unit

7.5 DEVIL Features

This section details the features of the current implementation of the DEVIL processor. The

reported numbers are estimates computed using the Synopsys 1998.98 tool.

7.5.1 DEVIL's Circuit Speed

The current implementation of the DEVIL processor runs at an estimated 50 MHz at 1.6 volts,

conferring to DEVIL an estimated performance level of 75 Dhrystone v2.1 MIPS. The critical

path is in the ALU datapath.

This is a low circuit speed for a 0.25� technology, an observation that can be explained

by several reasons. First, as stated earlier in this chapter, the VHDL implementation implies a

loss in circuit speed. This means that a full custom design would reach higher circuit speeds.

Furthermore, the �rst implementation of DEVIL is based on a 3-stage pipeline and therefore the

pipeline should be aggressively optimized to reach high clock frequencies. These optimizations

are time-consuming and have strong resource requirements that are beyond the scope of this

work.

7.5 DEVIL Features 85

7.5.2 DEVIL's Circuit Complexity

The transistor count of the DEVIL processors circuit complexity is approximately of 125'000

transistors. The cost of duplicating hardware is acceptable since DEVIL's transistor count

is in the lower bound of the current mobile processor transistor budget. For example, the

StrongARM has 2.1 million transistors, including the cache [71].

Module Transistor count breakdown

Fetch unit 4%

Decoder ALU1 8%

Decoder ALU2 8%

Decoder Load/Store 6%

Branch unit (incl. decoder) 8%

ALU1 9%

ALU2 9%

Load/Store unit 8%

Register File 38%

Writeback unit 2%

Table 7.1: Transistor count breakdown.

Table 7.1 shows the breakdown of the transistor count. The support for the variable

length instruction bundle represents only approximately 4% of the total circuit complexity.

This increase in complexity can be considered negligible considering the bene�ts in terms of

code size and memory tra�c that this system confers.

7.5.3 DEVIL's Circuit Power Consumption

The power consumption of DEVIL is estimated at 60 mW for a power supply voltage of 1.6 V

when running at 50 MHz. Table 7.2 shows the power consumption breakdown of the DEVIL

processor. The main source of power consumption is the register �le, with around 20% of the

total power dissipation.

Module Relative power consumption

Fetch unit 8%

Decoder ALU1 11%

Decoder ALU2 11%

Decoder Load/Store 7%

Branch unit (incl. decoder) 9%

ALU1 12%

ALU2 12%

Load/Store unit 6%

Register File 21%

Writeback unit 3%

Table 7.2: Power consumption breakdown.

Thanks to the implementation of DEVIL and to the extraction of the design features, it is

possible to estimate the extra energy consumption caused by the introduction of the multiple-

issue pipeline and of the dispatch mechanism. The two main sources of extra power consumption

are the dispatch unit and the register �le. The cost, in terms of additional energy consumption,

of the introduction of the VLIW architecture is estimated at 30%. Table 7.3 summarizes the

bene�ts of DEVIL compared to a 1-issue processor that executes DEVIL's instruction set.

These numbers are based on the 1.5 average speed up achieved by the DEVIL processor (see

86


Chapter 6). The parallelism allows DEVIL to run at 50 MHz at 1.6 volts while reaching the

same performance than the 1-issue processor powered at 2.2 volts and running at 75 MHz. This

is because DEVIL requires less energy to execute a given task in the same amount of time than

the 1-issue machine. The gain is of around 38% (for the average speed-up of 1.5). These results

validate the advantage of VLIW architectures in terms of energy e�ciency.

Processor Vdd Frequency MIPS Power MIPS/W MIPS2/mW

DEVIL 1.6 V 50 MHz 75 60 mW 1250 94

1-issue 1.6 V 50 MHz 50 31 mW 1613 81

1-issue 2.2 V 75 MHz 75 98 mW 765 57

Table 7.3: Summary of the bene�ts of ILP for low-power.

7.6 Comparison With Existing Processors

Table 7.4 summarizes the features of the DEVIL processor compared to existing low-power

processors available in the market today. As stated in the previous chapter, this comparison

only indicates where the DEVIL features are situated, considering that DEVIL's design can

be much more optimized. DEVIL's estimated features attain good MIPS=W and MIPS2=W

values, that lead to believe that, if optimized, the DEVIL architecture can o�er very attractive

features. At the moment, the major limitation is the clock frequency.

Model Vendor Techno. Vdd Freq. Power MIPS MIPS/W MIPS2/mW

ARM710 VLSI 0.8� 3.3 V 25 MHz 120 mW 30 250 8

SH7708 Hitachi 0.5� 3.3 V 25 MHz 95 mW 25 263 7

StrongARM Intel 0.35� 2 V 230 MHz 360 mW 268 744 200

ARM940T VLSI 0.35� 3.3 V 150 MHz 675 mW ?160 ?237 ?38

MMC2001 Motorola 0.35� 2 V 34 MHz 80 mW 31 387 12

TR4102 LSI 0.25� 1.8 V 80 MHz 40 mW ?90 ?2250 ?203

SH7750 Hitachi 0.25� 1.8 V 200 MHz 1.6 W 300 188 56

DEVIL 0.25� 1.6 V 50 MHz 60 mW 76 1266 96

Table 7.4: Mobile, Embedded, and ILP processor comparison.

7.7 Conclusion

This chapter described the VHDL implementation of the DEVIL processor. Thanks to this

implementation, estimates of the circuit complexity, circuit speed, and circuit power consump-

tion were computed, allowing an evaluation of the bene�ts of VLIW architectures for low-power

processors.

In terms of circuit speed, DEVIL runs at 50 MHz, which is quite slow for a 0.25� technology.

This is due to the synthesis methodology approach, as well as to the lack of resource to optimize

the DEVIL datapath.

The complexity of DEVIL was estimated to be around 125'000 transistors, categorizing

DEVIL as a simple circuit that should have a small die area. Furthermore, it was shown that

the dispatch unit introduced to handle the variable instruction length increases the circuit

complexity by only 4%, which is negligible considering the bene�ts of such mechanism.

7.7 Conclusion 87

Also, it was shown that ILP improves energy e�ciency by around 38% on average. This

confers to DEVIL the attractive possibility to execute code at the same speed than a scalar

processor while consuming less power.

This chapter allowed to justify the use of VLIW architectures into low-power processors.

The next step will be to optimize DEVIL's datapath according to the feedback of this �rst

prototype and build the �rst chip in order to get the exact circuit features.

88


Chapter 8

A Step Towards Predicated Execution

Introducing instruction-level parallelism into processors requires a strong compiler support.

High-Level Languages (HLLs) are generally used to reduce the product time to market. Un-

fortunately, the use of compilers and HLLs can have severe repercussions on the quality of

code compared to the traditional methods of hand-coding programs. First, compiler technology

a�ects the instruction memory utilization and code size. Although classic code optimizations

decrease the number of executed instructions, superscalar optimization, inline expansion, loop

unrolling, and superblock formation often increase the execution performance at the cost of

increasing the overall code size (see subsection 6.7.4). Second, although HLLs algorithm in

systematic ways that are good for maintenance and debugging purposes, the machine can po-

tentially be limited in performance due to its extremely sequential control �ow. Such problems

can seriously impact the processor's performance and cost (i.e., code size), which are critical in

embedded systems.

As the use of HLLs becomes inescapable in embedded systems, new compilation techniques

and hardware support should be used to overcome the HLLs barriers. Predication has several

features in terms of control �ow representation, performance, and code size that makes it very

attractive for both embedded and high-performance systems. To take advantage of such appre-

ciable features, several new compiler and architectural support are required. Full predication

support has been introduced in the new generation of high-performance processors such as the

HP/Intel IA-64 architecture [25]. For embedded architectures, predicated execution is gener-

ally supported via the use of a conditional move instruction. This partial predication support

reduces the bene�ts of predication as compared to full predication support [43]. However, if full

predication leads to a better code quality enhancement, it requires signi�cant changes in the

instruction set architecture, namely the addition of a new source operand for each instruction.

This Chapter investigates how full predication support can be introduced into embedded

architectures while meeting their strong constraints. Section 8.1 introduces the predicate de�ne

instructions, one of the most important component of a predicate architecture. Section 8.2

gives an overview of the bene�ts of predication in terms of code size (i.e., system cost). Sec-

tion 8.3 proposes a new way to introduce predicated execution support in embedded processors.

Section 8.4 addresses the control �ow optimization problem and presents a general compiler

framework that uses predication to optimize the control �ow of a program. Note that this

latter is valuable for both embedded and high-performance processors. Finally, Section 8.5

concludes.

89

90 A Step Towards Predicated Execution

8.1 Architecture Support for Full Predicated Execution

Predicated execution (see Section 2.6.3), the central architectural feature examined in this

chapter, is a mechanism that facilitates the conditional execution of individual instructions [54].

Predicates are registers that store a single bit value, representing either TRUE or FALSE.

Each instruction is associated with a particular predicate, known as its guard predicate, that

determines its execution. In the case when an instruction's guard predicate is TRUE, it executes

normally. Conversely, when an instruction's guard predicate is FALSE, it is nulli�ed.

The most important component of a predicate architecture is the instruction set support

for computing predicates or the predicate de�ne instructions. Predicate de�nes are inserted by

the compiler to generate values for control of conditional execution. The PlayDoh predicate

de�ne instruction [22] set provides the baseline for this work and is summarized below.

PlayDoh types

pSRC Comp UT UF OT OF AT AF

0 0 0 0 - - - -

0 1 0 0 - - - -

1 0 0 1 - 1 0 -

1 1 1 0 1 - - 0

Table 8.1: Predicate de�nition truth table.

PlayDoh is a parameterized Explicitly Parallel Instruction Computing (EPIC) architecture

intended to support public research on ILP architectures and compilation. PlayDoh predicate

de�ne instructions generate two Boolean values (pD0 and pD1) using a comparison of two

source operands ( src0 and src1 ) and a source predicate (pSRC). A PlayDoh predicate de�ne

instruction has the form:

pD0 type0; pD1 type1 = (src0 cond src1) hpSRCi.

The instruction is interpreted as follows: pD0 and pD1 are the destination predicate registers;

type0 and type1 are the predicate types of each destination; src0 cond src1 is the comparison,

where cond can be equal (==), not equal (! =), greater than (>), etc.; pSRC is the source

predicate register. The value assigned to each destination is dependent on the predicate type.

PlayDoh de�nes three predicate types, unconditional (UT or UF), wired-or (OT or OF), and

wired-and (AT or AF). Each type can be in either normal mode or complement mode, as

distinguished by the T or F appended to the type speci�er (U, O, or A). Complement mode

di�ers from normal mode only in that the condition evaluation is treated in the opposite logical

sense.

For each destination predicate register, a predicate de�ne instruction can either deposit

a 1, deposit a 0, or leave the contents unchanged. The predicate type speci�es a function of

the source predicate and the result of the comparison that is applied to derive the resultant

predicate. Table 8.1 shows the deposit rules for each of the PlayDoh predicate types in both

normal and complementmodes. Each entry corresponds to the result assigned to the destination

predicate. Note that a �-� means that the destination is left unchanged.

As shown in the table, the unconditional types are always assigned a value. For the

UT-type, the value corresponds to the logical conjunction of the source predicate and the

comparison result. Conversely, the or-type and the and-type each only assign a value in one

circumstance. The OT-type conditionally writes a 1 if both its source predicate and comparison

8.2 Compiler Techniques for Reducing Predicated Code Size 91

result are TRUE. The or-type can be used to e�ciently compute the disjunction of multiple

compare conditions by accumulating terms into an initially cleared predicate register. Since the

operations computing terms conditionally write the same value, they can execute in any order

or even in parallel. Similarly, the and-type can be used to compute the conjunction of multiple

compare conditions by accumulating terms into an initially set predicate register.

8.2 Compiler Techniques for Reducing Predicated Code Size

One very attractive feature of predication is that it allows to reduce the code size penalty

introduced by ILP optimizations and the traditional conditional branch representation of the

control �ow, while enabling to reach a better level of performance. In order to understand how

predication can be used to reduce code size, this section presents some examples extracted from

the MediaBench suite [35]. The compilation techniques utilized in these examples to exploit

predicated execution are based on hyperblock formation [44].

8.2.1 Reduction of Number of Control Instructions

Predicated execution o�ers a fundamentally di�erent method of expressing program execution

to the architecture. By design, instructions are guarded with predicates rather than by directing

the instruction execution stream to a particular path. The �rst benchmark example illustrates

the way predicated execution support in the ISA can reduce the number of control instructions

in a program. Figure 8.1(a) shows a control graph of code for the function re�ect1 from the

benchmark expic.

The instruction sequence contains 13 basic blocks with a total of 18 instructions, 8 of which

are branches. There are four conditional branches, with only two unique branch conditions B1and B2 de�ned by the source code. The control overhead in the instruction sequence is 8/18 =

44%. The same code after optimization is shown in Figure 8.1(b). The ine�ciencies of the code

of Figure 8.1(a) are reduced by performing branch outcome propagation and tail duplication

from the �rst instance of branch B1 to the other occurences. The optimized code contains 19

instructions, six of which are branches. The control instruction overhead is reduced to 6/19 =

32%, but at the cost of increasing the overall code size.

The instruction sequence with predicated execution is shown in Figure 8.1(c). The instruc-

tion count reduction for the predicated code comes from eliminating the unconditional jump

instructions required to represent the control �ow of the program to the architecture. As a

result, only two predicate de�ning instructions are used to control the sequence of execution.

The number of total instructions is 10, and the control instruction overhead is only 2/10 = 20%.

The number of control instructions is reduced by 75%, from 8 to 2 instructions with predicated

execution. For the whole expic benchmark, similar results of control instruction reduction are

observed.

8.2.2 Predicate Promotion and Instruction Merging

Predicate promotion refers to speculation performed by removing the predicate from a pred-

icated instruction [44]. Promotion results in the instruction being unconditionally executed,

essentially reducing the number of predicated instructions. Predicate instruction merging is a

form of promotion that allows identical instructions on complementary or intersecting predi-

cate conditions to be combined. Instruction merging thereby removes one instruction copy, and

promotes the remaining instruction to an earlier predicate condition.


blt r8, 0 blt r8, 0

neg r21, r8neg r20, r8mov r20, r8jmp

sub r18, r11, r20jmp

mov r21, r8jmp

blt r8, 0

neg r19, r8 mov r19, r8jmp

B1

FT

ld r82, r18

<p3>

...

B2

B1 B1

TF TF

blt r63, 0

TF

sub r63, r11, r19

sub r18, r21, r11

blt r8, 0

ld r82, r18

B2blt r63, 0

sub r18, r11, r20

T T

sub r18, r21, r11neg r21, r8

jmp

neg r20, r8

neg r19, r8

B1

T F

mov r19, r8

sub r63, r11, r19

blt r63, 0

jmp

B2

sub r18, r11, r20mov r20, r8

sub r63, r11, r19

mov r21, r8sub r18, r21, r11

jmp

p1_ut, p2_uf = (r8 < 0)neg r19, r8 mov r19, r8sub r63, r11, r19

<p1><p2>

p3_ut, p4_uf = (r63 < 0)neg r21, r8mov r21, r8sub r18, r11, r21sub r18, r21, r11ld r82, r18

<p1><p2><p4>

(a) (c)(b)

Figure 8.1: Predication example: (a) original, (b) optimized, and (c) predicated.

Figure 8.2(a) shows the source code for part of a switch statement in the function gl_DrawBu�er

from the benchmark mesa, an application using the OpenGL graphics library. There are sev-

eral aspects of the code that allow predicated execution to reduce the instruction count. First,

several case values activate the same program statements in the switch construct. Second, the

di�erent groupings of case values have statements in common. In fact, the only di�erence across

the three switch groupings is the source operand of the second statement, that selects either

the FrontAlpha, BackAlpha, or NULL value. The instruction and control �ow of the switch

are illustrated in Figure 8.2(b). The traditional way of executing the switch statement is by

executing several sequential branch instructions illustrated by the sequence of B instructions.

Other case values are not shown to make the example concise; however it is important to note

that subexpression elimination of the common instruction sequences are not possible for all case

values of the switch construct.

With predicated instruction support, the compiler is able to if-convert all the instructions

of the portion of the switch statement illustrated. After if-conversion, merging and predicate

promotion optimizations can be applied to predicated instructions. Figure 8.2(c) illustrates

the predicated code after optimization. The instructions that are common on both paths were

merged, and unconditionally executed. Only the dark-shaded instructions require predicate

operands. The �nal predicated code of Figure 8.2(c) illustrates the e�ectiveness of instruction

merging and predicate promotion. Since many of the instructions between the three switch

groupings are identical, the instructions can be merged together into a single copy. Only

the individual, non-shared instructions illustrated by the dark shading are predicated. The

predicate de�ning instructions indicated with a P perform this function. Also, additional

predicate and jump instructions are used in the second and third rows of the predicated code to

direct execution to the other case value statements. Although the number of control instructions

is slightly reduced, the real code size reduction comes from the sharing of instructions from

di�erent control paths while preserving performance. Overall, the instruction merging causes a

signi�cant reduction in the total number of instructions.

8.2 Compiler Techniques for Reducing Predicated Code Size 93

st ctx->Color.DrawBuffer, mode (S1)

Block 3

ld r1, ctx->Buffer (L1)

Block 2Block 1

ld r2, r1->BackAlpha (L4)

jmp Exit (J)

jmp Exit (J)st ctx->NewStat, r3 (S3)

st ctx->NewStat, r3 (S3)or r3, r3, NEW_RASTER_OPS (O)

or r3, r3, NEW_RASTER_OPS (O)

or r3, r3, NEW_RASTER_OPS (O)

ld r3, ctx->NewStat (L3)






jmp Exit (J)st ctx->NewStat, r3 (S3)

st r1->Alpha, r2 (S2)ld r2, r1->FrontAlpha (L2)


ctx->Buffer->Alpha = NULL;

case GL_BACK:st r1->Alpha, r2 (S2)

st r1->Alpha, NULL (S4)Block 3

Block 2

Block 1

case GL_NONE:

... } ctx->NewStat |= NEW_RASTER_OPS;

ctx->NewStat |= NEW_RASTER_OPS;

ctx->Color.DrawBuffer = mode;

ctx->NewStat |= NEW_RASTER_OPS; ctx->Buffer->Alpha = ctx->Buffer->FrontAlpha;

ctx->Buffer->Alpha = ctx->Buffer->BackAlpha; ctx->Color.DrawBuffer = mode;case GL_BACK_LEFT:

ctx->Color.DrawBuffer = mode;

case GL_FRONT_LEFT:

switch(mode){

case GL_FRONT_AND_BACK:

case GL_FRONT:

(a)

JO

S2J1L2 L4

PS4

B1

S1 L1L4S2L3

S3 JO

S4L3

S3 JO

S1 L1

(b) (c)

S1 L1P1 P3 P4 P5 P6P2

S1 L1L2S2L3

S3 JO

B6B5

B4B2

B3

S3

L3

Figure 8.2: Merging example: (a) source code, (b) original, and (c) predicated.

8.2.3 Instruction Reduction for Advanced Code Transformation

Predication's ability to reduce the number of instructions can also enable some code growth re-

duction in high performance optimizations. Consider the loop example of function extend_image

of expic in Figure 8.3(a). The loop is dominated by conditional branches in blocks A,B,C, andD, while the only computation of the loop is in block E. The conditional branches of blocks

A and C are loop invariant, but program variant. Without predication, the only way to take

advantage of the invariance characteristic is by using loop versioning. However, by versioning

several instances of the loop, a signi�cant amount of code growth occurs. The highlighted

path of Figure 8.3 indicates the frequently taken path of the loop ACDEF. Superblock ILP

and unrolling compilation techniques are applied to construct a superblock of the frequently

taken path that is loop-unrolled twice. The resulting control �ow representation is shown in

Figure 8.3(b). Several code blocks are tail duplicated, leading to code expansion.

With full predicate support, the compiler is able to perform several optimizations that

reduce the code size of the loop. First, instead of tail duplicating the code to form a superblock,

a hyperblock is constructed by if-converting all of the basic blocks of the loop. All conditional


branches of the original loop are replaced by predicate de�ning instructions. The corresponding

predicate de�ning instructions of the loop invariant, program variant branches of blocks A and

C are removed from the body of the loop and are placed in the header of the loop. The resulting

conditions computed by these instruction are placed in predicate registers for the duration of

the loop. It is unnecessary to replicate the predicate de�ne instructions in each iteration since

their results are loop invariant. This is one fundamental advantage of using predication to

convert control �ow dependences into data dependences.

beq r1, -1sgn r1, x_filt

F

F

T

F

T

F

T

FF

T

F

T

T

F

F

F

F

F

T

F

F

T

TF

T

T

add result, result, r6

ld r5, filt[xfilt]

sgn r1, x_filtp3_ot = (r1 == -1) <p1>

p3_ot = (r2 == 1) <p2>sgn r2, xfilt

if p3E

Fsgn r1, x_filtp3_ot = (r1 == -1) <p1>

p3_ot = (r2 == 1) <p2>sgn r2, xfilt

if p3E

F

p1_ut, p2 uf = (y_base==0) A and C

B

D

B

D

T

add r6, r5, r5

ld result, r3, r4

ld r4, clip[x_base+x_edge]

ld r3, clip[y_base+y_edge}

inc x_filt

blt x_filt, y_filt+x_stopinc x_edge

sgn r2, x_filt

bne y_base, 0

beq y_base, 0

E

beq r2, 1

F

D

C

B

A

F

F

T

F

F

F

T

T

T

C

D

F

E

C

A

E

A

B

C

F

D

E

F

E

D

A

D

F

C

B

(a) Basic blocks (b) Superblock with loop unrolling (c) Hyperblock

Figure 8.3: Loop optimization example: (a) original, (b) unrolled superblock, and (c) unrolled

predicated.

8.3 Introducing Predication Support into Embedded Processors

Previous section illustrates how predication can be used to reduce code size. However, predi-

cated execution requires several changes to existing ISAs, which can a�ect program code size.

Indeed, there is a major tradeo� in the design of the instruction set, namely the addition of a

predicate source operand for all instructions. This section proposes a new framework for intro-

ducing predication into embedded processors. The �rst part of this section presents the e�ect

of the ISA modi�cation, due to full predicated execution support, on program code size. The

second part of this section propose a new instruction issue mechanism that reduce the impact

on code size of the ISA modi�cation by supporting predicated and non-predicated versions of

instructions.

8.3.1 E�ect on Code Size of Full Predication Support

Although the performance increase of full predicated execution is signi�cant, it is at the cost

of adding a predicate source operand on every instruction. In full predication model, all in-

structions have a predicate source operand, even those which are not conditionally executed.

Figure 8.4 illustrates the percentage of static instructions with conditional predicates relative to

8.3 Introducing Predication Support into Embedded Processors 95

the overall number of instructions. The percentage of conditional instructions averages around

40% of the total instructions, meaning that a large portion of instructions do not require a pred-

icate operand. Since the percentage of unconditional instructions is signi�cant, the unnecessary

increase in instruction format size can dramatically impact embedded system designs.

expi

c

g721

ghos

tscr

ipt

gsm

jpeg

mes

a

mpe

g

pegw

it

pgp

rast

a

raw

audi

o

wc

cmp

grep lex

qsor

t

yacc

com

pres

s

AV

ER

AG

E

0%

10%

20%

30%

40%

50%

60%

70%

Pred

icat

ed I

nstr

uctio

ns

Figure 8.4: Relative number of predicated instructions.

Figure 8.5 shows the code size expansion attributed to the predicate operand for three

distinct models on the same predicated benchmarks. First, Zero Size shows the code size

for predication when the predicate representation has zero cost. Next, Predicate Only shows

the e�ect when the instruction size growth of the predicate operand is attributed to only the

conditional instructions. Finally, Full Size shows the size of the operand added to every static

instruction as designed in an architecture supporting full predication. All of the predicated code

sizes are compared to a base architecture without predication support. Note that compilation

for predication alone has some e�ect on code size. The size of the predicate operand was

evaluated assuming a 24-bit base instruction format and a 5-bit predicate operand �eld.

Figure 8.5 indicates that predicated execution increases program code size by an average

of 23%, and often as high as 30%. The results of the Zero Size model of code size evaluation

indicate that for a large number of programs, predication e�ectively has fewer instructions

and reduced code size. An interesting pattern is observed in Figure 8.5 for Predicate Only

instructions. As a general rule, the code size for this model is signi�cantly smaller than the Full

Size code size, and averages near the base non-predicated code size. The di�erence between

predicated and non-predicated results occur because predication has a fundamental ability to

remove numerous control instructions and because compiler support of predicated execution can

perform optimizations that allow the code to share instructions that are on di�erent execution

conditions. For example, the instruction D = A + X in Figure 2.25(d) does not require a

predicate operand since the compiler guarantees that it unconditionally executes in the block.


expi

c

AV

ER

AG

E

com

pres

s

yacc

qsor

t

lex

grep

cmp

wc

raw

audi

o

rast

a

pgp

pegw

it

mpe

g

mes

a

jpeg

gsm

ghos

tscr

ipt

g721

0.80

0.85

0.90

0.95

1.00

1.05

1.10

1.15

1.20

1.25R

elat

ive

Cod

e Si

ze

Zero Size Predicate Only Full Size

Figure 8.5: Code expansion considering predication source operand.

This optimization shows an important ability of the compiler to reduce the use of predication

in code, thereby increasing the disadvantage that traditional predicated architectures have by

requiring conditional and unconditional instructions to include a predicate operand. Further

details on the compiler's ability to a�ect code size and the percentage of predicated instructions

are presented in the next subsection.

8.3.2 Predication Code Size and Execution Characteristics

This subsection presents the static and dynamic characteristics of code size using predicated

execution and the e�ects of predicate optimization on code size. Figure 8.6 indicates that

predication reduces the total number of instructions for traditionally optimized code by 6.3%.

A signi�cant portion of the instructions eliminated were control instructions, which were reduced

by 13%, where control instructions include predicate de�ning instructions and any traditional

branch instructions. Other characteristics include a 7% reduction in the number of dynamically

executed instruction in general code, and a 31% reduction in the number of dynamic instructions

for code with superscalar optimizations.

Table 8.2 summarizes the amount of predicate optimization that the compiler is able to

perform on the hyperblocks. The optimizations are broken down into two categories, instruc-

tion merging and predicate promotion. For the instruction merging category, the percentage of

static predicated instructions averages 8% that can be merged. The additional code reduction

attributed to merging is shown in the next column, and indicates additional reduction in overall

code size. The percentage of predicated instructions that are promoted to unconditionally exe-

cuted instructions is shown in the next column. These numbers indicate that as many as 28% of

the originally predicated instructions of a hyperblock may be promoted with compiler optimiza-

tion. Since both merging and promotion can a�ect the same operation, the exact occurrence


expi

c

AV

ER

AG

E

com

pres

s

yacc

qsor

t

lex

grep

cmp

wc

raw

audi

o

rast

a

pgp

pegw

it

mpe

g

mes

a

jpeg

gsm

ghos

tscr

ipt

g721

30%

20%

10%

0%

Cod

e Si

ze R

educ

tion

Overall Control Operation

Figure 8.6: Code reductions due to predicated execution.

Benchmark Code Merging Hyperblock Predicate-Optimization

Merging % Reduction % Promotion % Static Pred % Static Pred %

expic 6.03 1.45 37.59 22.68 14.92

g721 1.60 0.85 43.31 52.64 29.77

ghostscript 0.26 1.01 32.68 41.31 24.79

gsm 3.21 1.80 51.44 44.78 23.28

jpeg 29.97 1.96 39.38 53.88 34.78

mesa 8.62 3.55 37.49 37.96 22.07

mpeg 5.05 2.40 34.52 46.03 26.13

pegwit 3.72 0.75 15.08 18.60 14.95

pgp 2.48 1.52 14.12 60.12 49.32

rasta 3.38 1.75 17.48 50.60 39.94

rawaudio 2.17 0.61 26.09 27.71 21.21

wc 16.92 7.91 10.77 43.33 40.29

cmp 22.12 11.57 16.81 46.89 37.04

grep 10.52 6.89 14.43 60.85 50.74

lex 11.87 5.44 14.97 43.15 32.75

qsort 8.00 1.61 48.00 20.49 11.90

yacc 7.30 2.48 26.26 32.83 21.11

compress 5.85 1.82 30.70 30.37 14.60

average 8.28 3.08 28.40 40.79 28.31

Table 8.2: Instruction merging and predicate promotion characteristics.

of which optimization has occurred is di�cult to collect within the compiler's infrastructure.

Nevertheless, the results of Table 8.2 indicate that relevant amounts of both optimizations af-

fect the percentage of predicated code. The �nal two columns include the percentage of static

predicated instructions relative to total program instructions, listed for the original hyperblocks

and the predicate-optimized hyperblocks. Although some instances of optimization on predi-

cated code lead to increases in the number of predicated instructions, the results of Table 8.2

show that in general predicate-optimizations signi�cantly reduce the percentage of predicated

instructions.

The most important characteristic in the results of Table 8.2 is that only 28% of the static

instructions remain predicated after agressive compiler optimization. This indicates that a large

number of instructions unconditionally execute and don't require a predicate operand. Thus,

the memory system of an embedded microprocessor is sacri�ced for the potential performance


gains. This analysis strongly supports the utility of an architecture framework which takes

advantage of predication's performance bene�ts while only adding size to those instructions

which are actually predicated.

8.3.3 Pre�x-Based Predication

Previous subsections show how a compiler using predication can reduce the program code

size, which is valuable for embedded systems. However, this bene�t can be diminished or

lost if the modi�cation of the ISA implies an increase in code size. Using the fact that after

optimization only a small percentage of instructions are predicated, this section details the

addition of predication to a 24-bit instruction word for embedded processors.

8.3.3.1 Architecture Model

Pre�x-based predication uses opcode pre�xing to add su�cient instruction bits to indicate that

a predicate operand exists for instructions which the compiler has designated to conditionally

execute. As illustrated in the previous section, a signi�cant amount of code size can be saved

when only the predicated instructions incur the predicate operand overhead. Figure 8.7 illus-

trates the base 24-bit instruction format that includes an operation code, a destination register

index, and two source operands (potentially register indexes or immediate data).

OP-CODE DEST SRC1 SRC0 OP-CODE DEST SRC1 SRC0 PRED

DECODER DECODERDECODERDECODE STRAGE

BYTE 9BYTE 10BYTE 11 I-CACHE

PREDICATED INSTRUCTIONNORMAL INSTRUCTION PREDICATE DEFININGLENGTH DECODER

AND STEERING STAGE

OP-CODE P_DEST PREDSRC1 SRC0

INSTRUCTION

BYTE 1

WILL BE USED IN THE FOLLOWING FETCH

BYTE 5BYTE 6

PREFIX

BYTE 8 BYTE 2BYTE 7 BYTE 3BYTE 4 BYTE 0

Figure 8.7: Pre�x-based predication decoding of normal and predicated instructions.

Figure 8.7 illustrates how a pre�x opcode of the 24-bit instruction can designate that an

additional 1-byte containing supplementary instruction information follows. The complete 32-

bit instruction can then be decoded into a 26-bit instruction with a 6-bit operation code, a 5-bit

predicate register index, a destination register index, and 2 source operands. The pre�x opcode

is then discarded. In this example architecture, the 5-bit predicate index can be used to access

a 32-entry predicate register �le. New predicate de�ning instructions for expressing predicate

conditions are also added using the pre�xng mechanism.

8.3.3.2 Microarchitecture support

The primary microarchitecture component a�ecting pre�x-based predication is the instruction

decode methodology. Most pre�x architecture designs integrate an additional instruction de-

code stage in the original pipeline design. In this model, the �rst stage is used to determine

instruction lengths (pre�x detection) and steer the instructions to the second stage where the

actual instruction decoding is performed. Figure 8.7 illustrates this process. The multiple

pipelined decode method is successful for several reasons. First, the design places the focus on

resources other than instruction memory. A second reason for using an additional decode stage


is that the number of branch instructions executed in a predicated architecture is signi�cantly

reduced, resulting in the number of mispredictions also being reduced. This limits the negative

e�ect of adding more pipeline stages before branch resolution has on the misprediction penalty.

The branch prediction accuracy for predicated architectures is about 7% higher than branch

prediction for traditional architectures.

8.3.4 Experimental Evaluation

8.3.4.1 Methodology

The IMPACT compiler and emulation-driven simulator were enhanced to support the proposed

architecture framework. The base architecture modeled uses a 5 stage pipeline that can issue

in-order 6 operations per cycle (up to the limit of the available functional units: four integer

ALU's, two memory ports, two �oating point ALU's, and one branch unit). The instruction

latencies used match the HP PA-7100 microprocessor (integer operations have 1-cycle latency,

and load operations have 2-cycle latency). The processor contains 32 integer and 32 �oating

point registers. To support pre�x-based predication, 32 predicate registers and an additional

decoding stage were modeled. The memory system simulated was either perfect or used a 2K,

4K, or 8K sized direct-mapped instruction caches and a 8K direct mapped, blocking data cache;

both with 64-byte blocks and a miss penalty of 12 cycles. A static branch prediction strategy

was employed.

8.3.4.2 Results and Analysis

lex

grep

cmp

expi

c

AV

ER

AG

E

com

pres

s

yacc

qsor

t

wc

1.0

2.14

2.02

raw

audi

o

rast

a

pgp

pegw

it

mpe

g

mes

a

jpeg

gsm

ghos

tscr

ipt

g721

1.1

1.86

1.2

1.3

1.4

1.5

1.6

1.7

Perf

orm

ance

2K 4K 8K

Figure 8.8: Performance of varying instruction cache size for pre�x-based predicated architec-

ture relative to non-predicated architecture.

Figure 8.8 shows the results of varying the instruction cache size for the non-predicated and

pre�x-based predicated architectures. Substantial performance improvement is established at


small cache sizes; however, for larger increases in instruction cache size, the relative perfor-

mance improvements of the base architecture are larger, and the relative performance saturates.

This indicates that the base model is more dependent on instruction cache resources than the

pre�x-based predicated architecture. The results of cache simulations show that pre�x-based

predication has an average 7% higher hit rate for 2K instruction caches and 2.5% for 8K caches

compared to the non-predicated model. Experiments also indicate that pre�x-based predi-

cation has an average 10% higher speedup over traditional predicated architectures for small

instruction cache models.

expi

c

AV

ER

AG

E

com

pres

s

yacc

qsor

t

lex

grep

cmp

wc

raw

audi

o

rast

a

pgp

pegw

it

mpe

g

mes

a

jpeg

gsm

ghos

tscr

ipt

g721

Rel

ativ

e C

ode

Size

0

1

2

3

4

5

6Non-predicated Full Predication Prefix Predication

Figure 8.9: Code expansion of superscalar relative to traditional optimization.

The relative performance of superscalar (superblock formation, loop unrolling) optimiza-

tion for pre�x-based predicated and non-predicated architectures is an average 63% better than

general levels of optimization for the simulation of a perfect memory system. For superscalar

optimization, the average speedup of the predicated architecture is only 12% more than the

non-predicated architecture. The performance of the superscalar optimization indicates that

the performance gains of predicated execution do not greatly exceed the non-predicated ver-

sion. However, the corresponding code size of the predicated code for high performance code

is signi�cantly reduced. Figure 8.9 shows the code expansion of the superscalar optimization

for the non-predicated, full-predicated, and pre�x-based predicated architectures. Clearly the

12% performance improvement is substantial since the improvement requires a signi�cantly

smaller code size. The full predicated architecture has an average 11% smaller code size and

the pre�x-based predicated architecture has an average 25% smaller size.

8.4 Control �ow optimization using predication

Previous section described a way to introduce full predicated execution support into embedded

processors. Such support gives new opportunities to generate more optimized code, especially

in the control �ow domain.

One fundamental limitation of most branch handling techniques is that they do not sig-

ni�cantly alter the program's control �ow logic. As the compiler translates high-level language

control constructs into assembly-level branches, it does not alter the basic control structure.

Instead, techniques focused on exposing and increasing ILP within a �xed control structure

8.4 Control �ow optimization using predication 101

are applied. With control speculation, this is obvious. Control dependences are removed to

enable the motion of instructions above branches. The branches themselves are not altered.

Likewise, when predication is applied by the process of if-conversion, branches are transformed

into predicate computations and control dependent instructions are rendered conditional by the

addition of guarding predicates. This process converts control �ow and control dependences

into data �ow and data dependences, but preserves the original program's control structure.

Restricting a compiler to use the program's unaltered control structure is undesirable for

several reasons. First, a high-level language such as C or C++ represents program control �ow

in an extremely sequential manner through the use of nested if-then-else statements, switch

statements, and loop constructs. Each control construct is fully evaluated before proceeding to

the next. This sequential computation often de�nes the program critical paths that constrain

the available ILP. Second, programmers represent control �ow for understandability or for ease

of debugging rather than for e�cient execution on the target architecture. As a result, software

often contains redundant control constructs that are di�cult to detect with traditional compiler

techniques. These may involve evaluating the same conditions multiple times or evaluating

conditions that partially overlap. An e�ective ILP compiler should be capable of transforming

the program control structure to eliminate these problems.

The ability to restructure code aggressively is a critical feature of an e�ective ILP com-

piler. The most obvious situation where aggressive transformation is regularly applied is on

arithmetic expressions. Compilers often completely restructure the programmer's arithmetic

computations into more parallel forms using a variety of transformations. These include ex-

pression re-association, tree height reduction [34], and blocked back substitution [55]. Although

ILP compilers may aggressively restructure computation, they typically preserve the program's

original control structure. This conservative approach can seriously limit the level of e�ciency

as well as the level of ILP achieved in branch-intensive programs.

Motivated by the potential of aggressive techniques for transforming arithmetic expres-

sions, this section introduces a new approach to optimizing program control �ow. The goal

of this work is to develop a systematic methodology for reformulating program control �ow

for more e�cient execution on an ILP processor. Control expressed in branches and predicate

de�ne instructions is �rst extracted and represented as a program decision logic network . Then,

a new, more e�cient network is synthesized with the goals of reducing dependence height and

redundancy. To accomplish the desired optimization and synthesis, the program decision logic

network is modeled as a Boolean equation. Boolean minimization techniques are then applied

to simplify and optimize the equation. Finally, the optimized network is re-expressed in the

form of predicated assembly code. One unique feature of this approach is that all branches and

predicates within a segment of code are treated jointly in a systematic manner.

This section focuses on compiler techniques and architecture support for e�ective optimiza-

tion of programmatic control �ow. In particular, the aspects of the HPL PlayDoh predicate

de�ne instructions [22], that are the most useful for this purposes, are highlighted. During the

process of developing this compiler support for programmatic logic optimization, a new class of

predicate de�ne instructions were designed to extend the PlayDoh architecture to support the

optimizer more e�ectively. The key idea behind this extension is presented and its e�ectiveness

through simulation of compiled codes that use this extension is shown. These experiments show

that programmatic logic optimization indeed results in substantial performance improvements

in functions where control �ow is the major impediment to exploiting ILP.


8.4.1 Previous Work

Previous research in the area of control �ow optimization can be classi�ed into three major cate-

gories: branch elimination, branch reordering, and control height reduction. Branch elimination

techniques identify and remove those branches whose direction is known at compile-time. The

simplest form of branch elimination is loop unrolling, in which instances of backedge branches

are removed by replicating the body of the loop. More sophisticated techniques examine pro-

gram control �ow and data �ow simultaneously to identify correlations among branches [12][47].

When a correlation is detected, a branch direction is determinable by the compiler along one

or more paths, and the branch can be eliminated. In [47], an algorithm is developed to identify

correlations and to perform the necessary code replication to remove branches within a local

scope. This approach is generalized and extended to the program-level scope in [12]. The sec-

ond category of control �ow optimization work is branch reordering. In this work, the order

in which branches are evaluated is changed to reduce the average depth traversed through a

network of branches [79].

The �nal category of control �ow optimization research focuses on the reduction of control

dependence height. This work attempts to collapse the sequential evaluation of linear chains of

branches in order to reduce the height of program critical paths [56]. In an approach analogous

to a carry lookahead adder, a lookahead branch is used to calculate the taken condition of a series

of branches in a parallel form. Subsequent operations dependent on any of the branches in the

series need only to wait for the lookahead branch to complete. The control dependence height of

the branch series is thus reduced to that of a single branch. The mechanisms introduced herein

also serve to reduce control dependence height. This work, however, introduces an approach

to minimization and re-expression of control �ow networks that is far more general than those

proposed in previous work.

8.4.2 Limitations of PlayDoh

Section 8.1 describes the predicate de�ne instruction of the PlayDoH. However, Our new strat-

egy for the generation of predicated code identi�es several limitations of the PlayDoh instruction

set. These limitations are described and our proposed extensions to the PlayDoh predicate de-

�ne instruction set are presented in this subsection.

PlayDoh types New types

pSRC Comp UT UF OT OF AT AF _T _F ^T ^F

0 0 0 0 - - - - - 1 0 0

0 1 0 0 - - - - 1 - 0 0

1 0 0 1 - 1 0 - 1 1 0 -

1 1 1 0 1 - - 0 1 1 - 0

Table 8.3: Extented predicate de�nition truth table.

The major limitation of the PlayDoh predicate types is that logical operations can only

be performed e�ciently amongst compare conditions. There is no convenient way to perform

arbitrary logical operations on predicate register values. While these operations could be ac-

complished using the PlayDoh predicate types, they often require either a large number of

operations or a long sequential chain of operations, or both.

With traditional approaches to generating predicated code, these limitations are not se-

rious, as there is little need to support logical operations amongst predicates. The Boolean


minimization strategy described in the next subsection, however, makes extensive use of logical

operations on arbitrary sets of both predicates and conditions. In this approach, intermediate

predicates are calculated that contain logical subexpressions of the �nal predicate expressions

to facilitate reuse of terms or partial terms. The intermediate predicates are then logically

combined with other intermediate predicates or other compare conditions to generate the �nal

predicate values. Without e�cient support for these logical combinations, gains of the Boolean

minimization approach are diluted or lost.

Predicate De�ne Extensions. Two new predicate types are introduced to facilitate

generating e�cient code using our minimization techniques. These are referred to as disjunctive-

type (_T or _F) and conjunctive-type (^T or ^F). Table 8.3 (right-hand portion) shows the

deposit rules for the new predicate types. The ^T-type de�ne clears the destination predicate

to 0 if either the source predicate is FALSE or the comparison result is FALSE. Otherwise, the

destination is left unchanged. Note that this behavior di�ers from that of the and-type predicate

de�ne, in that the and-type de�ne leaves the destination unaltered when the source predicate

evaluates to FALSE. The conjunctive-type thus enables the compiler easily and e�ciently to

form the logical conjunction of an arbitrary set of conditions and predicates.

The disjunctive-type behavior is analogous to that of the conjunctive-type. With the ^T-

type de�ne, the destination predicate is set to 1 if either the source predicate is TRUE or the

comparison result is TRUE (FALSE for ^F). The disjunctive-type is thus used to compute the

disjunction of an arbitrary set of predicates and compare conditions into a single predicate.

8.4.3 Overview of Compiler Techniques

This subsection presents a conceptual overview of the program decision logic minimization

process, starting with the conversion of code to the predicated representation for subsequent

optimization. In order to simplify the extraction and manipulation of control expressions, the

compiler applies if-conversion and reformulation of non-branch control constructs to transform

all programmatic control �ow into the predicated representation. In the IMPACT compiler,

this conversion is fully performed within acyclic code regions formed using hyperblock formation

heuristics [44]. To a great extent, the ability of our control logic optimization techniques

to improve performance depends on the scope of these regions, as only the control structure

transformed into the predicate domain is available for subsequent optimization. In order to

promote e�ective hyperblock formation, aggressive function inlining is performed.

An example extracted from the UNIX utility wc illustrates the application and bene�t

of the described techniques. Figure 8.10 shows the code segment before and after complete

if-conversion. As shown in Figure 8.10(a), the code before if-conversion consists of basic blocks

and conditional branches (shown in bold) which direct the �ow of control through the basic

blocks. As shown in Figure 8.10(b), the code after if-conversion consists of only a single block

of sequential instructions, a hyperblock [43]. The conditional branches have been replaced

with predicate de�ne instructions (shown in bold) and the predicate registers de�ned have

been placed as source operands on all guarded instructions in accordance with their execution

conditions.

After if-conversion, control speculation is performed to increase opportunities for optimiza-

tion. Control speculation is a means of breaking a control dependence by allowing an instruction

to execute more frequently than is necessary. In a predicated representation, this is performed

in predicate promotion, the process by which predicate �ow dependences are broken and instruc-

tions are made to execute speculatively by changing an instruction's guard predicate to another

predicate, whose expression subsumes that of the original [44]. When instructions are aggres-


(a)

r24 = MEM[r3]

r23 = r24 + 1

MEM[r3] = r23

p4 = 0

p5 = 0

1

2

4

3

5

6

7

8

9

10

11

13

12

14

15

17

18

19

20

21

16

(b)

Jump Loop

r2 = 0 <p5>

p5_of = (r4 != 9) <p8>

p5_of, p8_ut = (r4 != 32)

F

<p7>

MEM[71] = r61 <p6>

r61 = r62 + 1 <p6>

r62 = MEM[r71] <p6>

r2 = r2 + 1 <p3>

MEM[r72] = r26 <p3>

r26 = r27 + 1 <p3>

r27 = MEM[r72] <p3>

p7_ut = (r4 != 10) <p4>

p5_of, p6_uf = (r4 != 10) <p4>

p3_ut = (r2 == 0) <p2>

p4_ot, p2_uf = (r4 >= 127) <p1>

p4_ot, p1_uf = (32 >= r4)r4 = MEM[r24]

F

r27 = MEM[r72]

r26 = r27 + 1

MEM[r72] = r26

r2 = r2 + 1

r2 = 0

MEM[71] = r61

r61 = r62 + 1

r62 = MEM[r71]

TF

F

T

T

T

F

F

T

T

Branch r4 >= 127

Branch r2 == 0 Branch r4 != 10

Branch r4 != 9

Branch r4 != 32

Branch 32 >= r4

Jump Loop

r24 = MEM[r3]

r23 = r24 + 1

MEM[r3] = r23

r4 = MEM[r24]

22

Loop:Loop:

Figure 8.10: A portion of the inner loop of the UNIX utility wc. The control �ow graph (a),

and the corresponding hyperblock formed after complete if-conversion (b).

sively promoted, some predicates may no longer be utilized as guards on computation. When a

predicate is no longer necessary, the program decision logic is simpli�ed. Figure 8.11(a) shows

the wc hyperblock segment after predicate promotion. Comparison with Figure 8.10(b) shows

that four instructions (12, 13, 16, and 17) have had their predicates promoted to the TRUE

predicate, denoted in the �gure as the absence of a source predicate. However, no predicates

were rendered completely unused by this process.

Next, the program decision logic network is constructed. Since predicates can only assume

Boolean values, predicates and predicate de�nes can be viewed as a combinational logic circuit.

To derive the Boolean function from a hyperblock, the compiler needs only to examine the

predicate de�ne instructions. Consider instructions 7 and 8 in Figure 8.11(a), in which the

expression for p1 can be written as: p1 = C0 and p2 can be written as: p2 = p1C1, where C0 is

the condition: (32 � r4) and C1 is the condition: (r4 � 127). The expression for p2, in terms of

conditions, is p2 = C0C1. In the course of this complete back substitution, expressions based on

condition variables are formulated for all predicate de�ne instructions. The composition of all

these expressions is the program decision logic network. This network can be modeled as a logic

circuit that represents all the decisions made in the program. The logic circuit has conditions

as its input and the predicates which control computation as its output. The multiple-output

Boolean logic circuit for the wc code segment is shown in Figure 8.11(b).

Once the logic circuit has been derived, many CAD techniques can be employed to simplify

the program decision logic network. In the IMPACT compiler, the derived Boolean function

is represented with a Binary Decision Diagram (BDD) [5]. The BDD algorithms used are de-

scribed in [13]. The predicate BDD contains the relationship among predicates as de�ned by the

network of predicate de�ne operations. The predicate BDD is used throughout the compiler as

a database for queries made by optimizations when operating on predicated code. For example,


(C0) (C1)

(C3)

(C5)

(C2)

(C4)

��

��

��

��

��

��

��

��

��

p3

p4

r4 != 32

p7

r4 !=9

p8

32 >= r4 r4 >= 127

p2

r2 ==0

r4 != 10

p4

p7

p5

1

2

3

4

5

6

7

8

9

10

11

12

13

14

p1

15

16

17

18

19

20

21

MEM[r73] = r23

r23 = r24 + 1

r24 = MEM[r73]

p5 = 0

p3 = 1

p5

r4 !=9

r4 != 32

r4 != 10

32 >= r4

r4 >= 127

r2 ==0

p3

p6

p6 22 Jump Loop

r2 = 0 <p5>

p5_of = (r4 != 9)p5_of = (r4 != 32)MEM[71] = r61 <p6>

r61 = r62 + 1

r62 = MEM[r71]

r2 = r2 + 1 <p3>

MEM[r72] = r26 <p3>

r26 = r27 + 1

r27 = MEM[r72]

p5_of, p6_uf = (r4 != 10)p3_at = (r2 == 0)p3_af = (r4 >= 127)p3_af = (32 >= r4)r4 = MEM[r24]

(a) (b) (c) (d)

<p1>

<p2>

<p4>

<p4>

<p3>

<p3>

<p6>

<p7>

<p8>

<p5>

p4 = 0

p5 = 0

1

2

r24 = MEM[r3]

r23 = r24 + 1

3

4

MEM[r3] = r23

6

5

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

r4 = MEM[r24]

p4_ot, p1_uf = (32 >= r4)p4_ot, p2_uf = (r4 >= 127)p3_ut = (r2 == 0)p5_of, p6_uf = (r4 != 10)p7_ut = (r4 != 10)r27 = MEM[r72]

r26 = r27 + 1

MEM[r72] = r26

r2 = r2 + 1

r62 = MEM[r71]

r61 = r62 + 1

MEM[71] = r61

p5_of, p8_ut = (r4 != 32)p5_of = (r4 != 9)r2 = 0

Jump Loop22

Loop: Loop:

p1

Figure 8.11: The wc hyperblock after speculation but before logic minimization (a) and its cor-

responding logic diagram (b). The hyperblock after logic minimization (c) and its corresponding

logic diagram (d).

one common query is to determine if one instruction executes only when another instruction

has executed. This query is equivalent to the dominance relationship in the control �ow do-

main. Here, the BDD is queried to determine if the predicate expression of one instruction

subsumes the predicate expression of another. Queries to the BDD are made in IMPACT by

the optimizer, the scheduler, and data�ow analysis.

For the purposes of decision logic minimization, the BDD provides a simple method by

which expressions describing the hyperblock logic can be derived. The only expressions re-

quested from the BDD are those expressions describing the essential predicates . Essential

predicates are those predicates that guard real computation instructions (any instruction that

is not a predicate de�ne). In Figure 8.11(a), the essential predicates are p3, p5, and p6. Predi-

cates p1, p2, p4, p7, and p8 are non-essential predicates as they are used only as intermediates

in evaluation of the essential predicates.

The BDD maintains a canonical representation of the decision logic functions, from which

a Boolean sum-of-products expression can be produced for any represented function. Note that

the expression thus generated re�ects the canonical nature of the BDD's internal representation,

and is usually not optimal for expressions with multiple product terms. Therefore, it is necessary

to optimize the derived expression before attempting to synthesize a predicate de�ning structure.

The expressions describing the evaluation of the essential predicates are optimized using

techniques which eliminate redundant terms in the function and which re-express the Boolean

function in a more parallel form. The resulting expression is reformulated back into predicate

de�ne instructions in the hyperblock. Section 8.4.4 presents the details of the Boolean logic

optimizers and reformulators studied in this work. These optimizers and reformulators must

balance the reduction of dependence height with the number of predicate de�nes that can

be accommodated in the code schedule. This involves making an accurate estimate of how


much time is available for computation of control functions based on the availability times of

conditions and when predicates need to be consumed. These and other considerations make

the design of an optimizer and a reformulator nontrivial.

Figures 8.11(c) and 8.11(d) show the reformulated hyperblock and corresponding logic cir-

cuit after the minimization process is complete. The number of logic gates in the circuit imple-

mentation is reduced from ten to three. In addition, the six-level gate network in Figure 8.11(b)

is reduced to a single-level gate network in Figure 8.11(d). All non-essential predicates were

also eliminated as part of this process. An example optimization performed on the logic circuit

takes the form: C0 + C1C0 ! C0 + C1. An application of this optimization occurs between

instructions 7 and 8 when computing p4.

The values of variables in the decision logic network are supplied by evaluating conditions

on predicate de�ne instructions. It is important to recognize that these variables are not neces-

sarily independent, and that knowledge of the relationships between these variables can allow

for signi�cant further optimization of the predicate de�ne structure. Consider the computation

of p6 in Figure 8.11(a). Instruction 10 computes p6 uf = C3 hp4i. Logically, this leads to the ex-

pression p6 = C3(C0+C1), where C0 = (32 � r4), C1 = (r4 � 127), and C3 = (r4 6= 10). Here,

since C3 implies C0 and excludes C1, the expression for p6 can be simpli�ed to p6 = C3. In our

approach, the relationships between conditions are represented in a BDD, termed the condition

BDD, which can be queried to determine if logical implications exist between conditions and, if

so, what they are. The current implementation of this mechanism identi�es �families� of integer

register-constant comparisons which are based on the same de�nition of a given register. Then,

within each family, a number line is created and divided into disjoint segments from which the

set of register values yielding a �TRUE� evaluation for any member condition can be composed

by union. Finally, the relationships between the comparisons are described in BDD form using

a �nite domain technique [14]. Various elements of the optimizer query this BDD to determine

the inherent relationships between conditions, which are the decision network's input variables.

Cycle Instructions issued

0 op1 op2 op3 op12 op16

1 op4 op6 op13 op17

2 op5 op7

3 op8

4 op9 op10 op11

5 op14 op15 op18 op19

6 op20

7 op21 op22

(a) Schedule for the hyperblock in Figure 8.11(a).

Cycle Instructions issued



2 op5 op7 op8 op10 op19 op20

3 op14 op15 op18 op21 op22

(b) Schedule for the hyperblock in Figure 8.11(c).

Figure 8.12: Comparison of the static schedules for the wc hyperblock before and after logic

minimization.


The overall e�ectiveness of the program decision logic minimization process on the wc

example is best shown by comparing the schedules of the code before and after optimization. For

illustration purposes, a six-issue processor with no restrictions on the combination of instructions

that may simultaneously be issued is assumed. Furthermore, all instructions are assumed to

have a latency of one cycle. Figure 8.12 presents the schedules for the example hyperblock

before and after optimization. The instructions in bold correspond to the predicate de�nes in

each hyperblock. The schedule for the pre-optimization hyperblock (Figure 8.12(a)) is relatively

sparse due to the sequentiality of the predicate de�nes. The overall schedule length is eight

cycles. The schedule after logic minimization is reduced by a factor of two. The chain of

predicate de�ne instructions in the original hyperblock is replaced by a parallel, more e�cient

computation in the optimized hyperblock. The reformulated hyperblock requires only a single

level of predicate de�nes to compute the essential predicates as opposed to the �ve-level network

used in the original code, yielding a signi�cant increase in performance.

Once the decision component has been optimized and reformulated back into the predicated

representation, further compiler transformations need to be performed. For machines without

real predication support, complete reverse if-conversion must be performed [76]. For machines

which support predication, partial reverse if-conversion can be employed to create the proper

balance of control �ow and predication for the target architecture [8].

8.4.4 Minimization of Program Decision Logic

The previous subsection provided an overview of the process of program control height mini-

mization through the optimization of the predicate de�ne network. This section describes in

detail the mechanisms by which the predicate de�ne optimizer generates new predicate de�ne

instructions to evaluate more e�ciently the program's essential predicate functions. The dis-

cussion in this section assumes that the program's decision logic has been represented by the

predicate BDD and the condition BDD, and that Sum-Of-Products (SOP) expressions for the

essential predicates have been extracted as described in the previous section. Once the pro-

gram decision logic has been extracted, program control is optimized and re-expressed in four

steps. First, sum-of-products expressions are formed to represent predicate functions in terms

of program conditions. These expressions are then optimized using condition analysis and tra-

ditional Boolean logic minimization techniques. The resulting optimized expressions are then

optionally factorized based on condition availability times and resource constraints. Finally,

program control is re-expressed in predicate de�ne instructions, either in a two-level network

or in a multi-level network, depending on whether or not factorization was performed.

The generation of an e�cient predicate de�ne network begins with the extraction and sub-

sequent optimization of the sums-of-products for the predicate functions. Figure 8.13(b) shows

the expressions extracted for the essential predicates in the wc example, as well as the condi-

tions to which the variables in the expressions correspond. Figure 8.13(a) shows the original

predicate de�ne network for reference. Since the control expressions are completely represented

by the predicate BDD in terms of conditions, the non-essential predicates are eliminated from

consideration. This process maps the predicate de�ne structure, in this case �ve stages of

predicate de�ne instructions, into a sum-of-products which can be synthesized into a two-cycle

sequence of predicate de�ne instructions. However, this expression can exhibit a large number

of redundant and constant-FALSE products, and must be re�ned before use in de�ne regen-

eration. From Figure 8.13(b), two-level regeneration of the unoptimized expressions of the wc

example would require thirteen predicate de�nes in the �rst level and six in the second, far

more than the seven required in the initial network.


<p4>

p4_ot, p1_uf = (32>=r4)p4_ot, p2_uf = (r4>=127)p3_ut = (r2 == 0)p5_of, p8_ut = (r4 != 32)p5_of = (r4 != 9)

<p1>p5_of, p6_uf = (r4 != 10)<p2>

<p7><p8>

p7_ut = (r4 != 10) <p4>

(a) Original predicate de�ne structure.

C0 (32>=r4)

C1 (r4>=127)

C2 (r2==0)

C3 (r4!=10)

C4 (r4!=32)

C5 (r4!=9)

p3 C0C1C2

p6 C0C3+C0C1C3

C0C3+C0C1C3+

p5 C0C3C4+C0C1C3C4+

C0C3C4C5+C0C1C3C4C5

(b) Conditions and original predicate

expressions.

p3 C0C1C2

p6 C3

p5 C3+C4+C5

(c) Optimized predicate expressions.

... p5_of, p6_uf = (r4 != 10) p5_of = (r4 != 9)

p3_af = (32 >= r4) p3_af = (r4 >= 127) p3_at = (r2 == 0) ...

p5_of = (r4 != 32)

(d) Optimized predicate de�ne structure.

Figure 8.13: Example: optimization of wc predicate network.


Simplify_funcs(func list)1 simplified func list = Empty_list();

2 FOREACH func IN func list DO

3 reduced func = Reduce_using_condition_BDD(func);

4 simplified func = Minimize_SOP(reduced func);

5 List_append(simplified func list, simplified func);

6 RETURN simplified func list;

Minimize_SOP(func)1 product list = func:product list;

2 new product list = product list;

3 WHILE NOT List_empty(new product list) DO

4 new insertion list = Empty_list();

5 FOREACH product x IN new product list DO

6 FOREACH product y IN product list DO

7 consensus = Consensus(product x, product y);

8 IF consensus THEN

9 List_insert_last(new insertion list, consensus);

10 product list = List_append(product list, new insertion list);

11 new product list = new insertion list;

12 product list = Eliminate_subsumed_products(product list);

13 product list = Select_covering_subset(product list);

14 RETURN product list;

Factorize(func list; sched)1 factor list = Empty_list();

2 FOREACH func x IN func list DO

3 FOREACH func y IN func list BEFORE func x DO

4 IF Factor_simpli�es(func y, func x) THEN

5 IF Resource_constrained(func x:id) THEN

6 IF NOT (List_member(func y, factor list) THEN

7 List_insert_last(factor list, func y);

8 func x = Factor_SOP(func x, func y);

9 FOR cycle = sched:min cycle TO sched:max cycle DO

10 FOREACH func IN func list DO

11 FOREACH product IN func DO

12 ready prod = Ready_product(product, cycle);

13 match prod = Match_term(ready prod, factor list);

14 IF match prod THEN

15 ready factor = match prod;

16 ELSE

17 ready factor = ready prod;

18 ready factor:id = Unique_token();

19 List_insert(factor list, ready factor);

20 Factor_term(product, ready factor);

21 List_insert_last(factor list, func list);

22 Factor_common_disjoint_subexpr(factor list, func list);

23 RETURN func list, factor list;

Factor_common_disjoint_subexpr(factor list, func list)1 FOREACH func IN func list DO

2 product factor list = Extract_ready_products (func);

3 fact func = Find_factor(product factor list, func);

4 IF fact func THEN

5 match fact = Match_factor(fact func, factor list);

6 IF NOT (match fact) THEN

7 fact func:id = Unique_token();

8 List_insert(factor list, fact func);

9 match fact = fact func;

10 Factor_term(func, match fact);

Figure 8.14: Pseudo-code for performing optimization of predicate expressions


Optimization of predicate expressions. Predicate expressions are optimized in two

steps, as indicated in Figure 8.14 in the description of Simplify_funcs. First, expressions are

reduced using condition BDD information. For example, conditions which imply or exclude each

other (i.e. (r1 < 4) implies (r1 < 5) and excludes (r1 >= 7)), can cause predicate expressions to

contain redundant or constant-FALSE products, as well as redundant literals in useful products.

These extraneous features are removed in this phase. One such case from the benchmark wc

was examined in Section 8.4.3.

Once redundant and constant-FALSE products and literals have been removed from the

predicate expressions, the iterative-consensus method is applied to produce a complete sum, and

then to select a subset of prime implicants for a simpli�ed two-level logic implementation [75].

Pseudo-code for this algorithm is shown in Figure 8.14 (Minimize_SOP). The heart of this

iterative algorithm is the consensus-taking routine, which applies the Boolean theorem x +

xy ! x + y. After each pass through the product list, products subsumed (covered) by other

products are removed. The iterative-consensus algorithm generates a complete sum for the

input expression. Non-essential products can then be removed to generate a minimal covering

sum.

In this application, the Boolean predicate expressions can be composed of a large number

of variables and products (more than thirty in some instances), rendering a direct implemen-

tation of the iterative-consensus algorithm, which is exponential, intolerably slow. For this

reason, when operating on large functions we apply an heuristic approximation to the iterative-

consensus method. This heuristic decreases dramatically the number of intermediate products,

and therefore renders the compile time reasonable. Furthermore, using this heuristic, the selec-

tion of the minimal sum-of-products expression (covering subset), also ordinarily an expensive

procedure, is reduced to a linear form.

The cost of this heuristic is that the result could be suboptimal, which could cause the

generation of expressions with more predicate de�ne instructions than necessary. Depending

on the order in which the comparisons are made, the heuristic may eliminate some products

that are necessary to generate other simpler products. To minimize this problem the heuristic

includes a manipulation which sorts the products in order to reduce the likelihood of a non-

optimal solution.

Figure 8.13(c) shows the expressions to which the essential predicates of the wc example

are reduced in the logic optimization phase. These expressions are both less complex and more

parallel than the original functions.

Two-level predicate synthesis. Following optimization of the predicate expressions, the

control logic can be synthesized most intuitively as a two-level predicate de�ne network which

directly evaluates the minimized sum-of-products expression. In this approach, two levels of

predicate de�ne instructions are used for each predicate. The �rst level consists of and-type

predicate de�nes of the form pi at = CihT i, where one predicate pi is de�ned for each product

term in the predicate expression, and T is the TRUE predicate, which always has the value

1. The second level consists of or-type predicate de�nes of the form pj ot = (condT )hpii,

where there is one such predicate de�ne for each product (pi) and condT is an invariant TRUE

condition (e.g. (0 == 0)). Thus, a predicate expression having L literals and M products

consumes M + 1 predicates and performs L +M predicate assignments. Continuing the wc

example in Figure 8.13(d), note that the two special cases of two-level predicate synthesis

occur, in which the computation of functions containing a single product and functions that

are disjunctions of single-literal products can be performed in a single cycle. Note also that

predicates which have products in common can share intermediate predicates, allowing for

some savings through reuse. In most cases, however, two-level synthesis generates an enormous


number of predicate de�ne instructions, since redundancy between products is not reduced.

Furthermore, since the evaluation of such a predicate de�ne network usually takes at least

two cycles after the last condition becomes available (one for the and-level and one for the

or-level), the result may also be suboptimal in latency, even when scheduled for in�nite issue.

Results demonstrating both these phenomena are presented in Section 8.4.6. Clearly, a more

sophisticated technique is required.

Factorization. In the example of the previous section, the code sample from wc exhibited

a large ratio of control height to computation height, and the computation was nearly completely

dependent on the outcome of the decision mechanisms. Thus, it was important to compress the

height of the entire decision structure as much as possible, as any reduction in the decision height

improved performance. Furthermore, since the predicate conditions were strongly related, the

resulting predicate de�ne structure actually reduced the predicate and predicate de�ne count.

In many other situations, however, predicates are based on more independent conditions and

the number of predicate de�ne instructions required to generate a two-level network may be

quite large. Factorization seeks to use the code's computation or datapath height to hide some

portions of the decision latency which are not on the critical path. Thus, the optimizer is free to

focus on reducing implementation size rather than delay when implementing these non-critical

sections, saving valuable predicate registers and instruction issue resources.

The factored generation method determines how much factoring can be performed at no

cost. The availability times of conditions and the time at which predicate values are needed by

the computation component drive the factorizer. If parallel computation height, rather than

predicate de�ne height, is the critical path through the code segment, then it is bene�cial to

perform factorization instead of full expression �attening.

To measure the availability times of conditions and the time at which predicate values are

needed, a special version of the code is scheduled. This version of the code has all the predicate

dependences between predicate de�nes removed. For each condition, a predicate destination

is added for each predicate whose function depends on that condition. In the resultant code,

predicate de�ne instructions are placed as early in the schedule as their condition availability

will allow. Also, all uses of a predicate are placed as early as possible, but after all the conditions

which may be needed to compute it. By extracting the issue time of these predicate de�nes

and predicate uses, the amount of time the new predicate network has to compute predicates

without performance penalty is ascertained. This information is then used together with the

previously extracted predicate expressions in later stages of optimization.

With factorization, the goal is to form intermediate predicates as the conditions to compute

them become available, and then to reuse these intermediate predicates in the computation of

the essential predicates. This activity factors the optimized sum-of-products expression or

its products so that the resulting de�ne structure may take more cycles, but can reuse more

intermediate predicates, thus saving predicate de�nes and predicate registers.

In certain cases, when resource utilization is very high and predicate functions are very

complex, factorization becomes critical for performance. In some cases, generation of code which

would optimally generate the predicate results on an in�nitely wide machine could actually

degrade performance in a real machine due to excessive width. In these situations, an additional

factorization preprocessing stage is applied, in which predicates are selectively factored on

subexpressions available in essential predicates generated earlier in the original code. This

activity, shown in lines 2 though 8 of Factorize in Figure 8.14, has the e�ect of moderating the

restructuring of control in cases where reordering of the predicate expressions would generate

a de�ne network too wide for the target architecture.

Figure 8.15 shows an example extracted from the function cofactor of the 008.espresso


Pred Expression Use Cycle

p1 C0C2C4C5+ 6

C0C2C3C5+

C0C1C5

p2 C0C2C4C5C6+ 7

C0C2C3C5C6+

C0C1C5C6

(a) Optimized predicate expressions.

C0 C1 C2 C3 C4 C5 C6

1 1 2 3 4 5 6

(b) Condition availability.

Time Predicate expression

1 p3 ut = C0

p4 at = C0

p4 at = C1

2 p5 ut = C2

p6 ut = C2 hp3i

3 p7 ut = C3 hp6i

4 p8 ut = C4 hp6i

5 p1 of = C5 hp7i

p1 of = C5 hp8i

p1 of = C5 hp4i

6 p2 ut = C6 hp1i

(c) Factoring with schedule time information.

Figure 8.15: Factorized predicate de�ne optimization.

benchmark. The minimal sum-of-products is computed for each of the �nal predicates, as

shown in Figure 8.15(a). Next, with the help of condition availability and predicate use times

from Figure 8.15(a) and 8.15(b), all useful predicates are factorized, and common expressions are

shared. Figure 8.15(c) shows the result of this method. This factoring results in the reduction

of the number of predicate de�ne instructions from 37 to 13. Furthermore, the useful predicates

(p1 and p2) are available a single cycle after the last condition is evaluated, sooner than would

be possible using a two-level synthesis of the predicate expressions, two cycles after the last

condition evaluation.

In the direct sum-of-products conversion, the computation of p1 and p2 begin respectively

at cycle 5 and cycle 6, at the availability time of their latest conditions; results are available two

cycles later. With the factorization method, however, predicates p1 and p2 can be evaluated

in a single cycle after the availability of C5 and C6. Thus, in some cases, the factorization

method is able to reduce predicate latency by one cycle compared to the result of the direct

sum-of-products conversion.


8.4.5 Architecture Support for Synthesis

Pred Expression Use Cycle

p1 C1 + C2 3

p2 C0C1C3 + C0C2C3 4

(a) Optimized predicate expressions.

C0 C1 C2 C3

1 2 2 3

(b) Condition availability.


1 p2 ^t = C0

2 p1 ot = C1

p1 ot = C2

3 p2 ^t = C3 hp1i

(c) Factorization with conjunctive-type

predicate de�nes.


1 p3 at = C0

p4 at = C0

2 p1 ot = C1

p1 ot = C2

p3 at = C1

p4 at = C2

3 p3 at = C3

p4 at = C3

4 p2 ot = TRUE hp3i

p2 ot = TRUE hp4i

(d) No factorization


1 p2 at = C0

2 p1 ot = C1

p1 ot = C2

p3 af = C1

p3 af = C2

3 p2 at = C3

p2 af = TRUE hp3i

(e) Factorization without conjunctive-type

predicate de�nes.

Figure 8.16: Various methods of predicate expresssion regeneration.

Description of the predicate optimization in previous sections has disregarded the means by

which Boolean expressions are converted back into predicate de�ning instructions. This section

examines the instruction set considerations that evolved in supporting an e�ective predicate

synthesis system.

Implementation of two-level predicate synthesis is straightforward in the HPL Playdoh

predicate architecture. For example, in Figures 8.11 and 8.13(c), a simple sum-of-product

expression is converted into a small set of predicate de�nes.

Synthesis of multi-level factored functions is not as simple as product-of-sums or sum-of-

products expressions, but yields signi�cant improvements in both performance and predicate

de�ne count. When an expression is factored out of one or more predicate expressions, its value

is computed and stored in a predicate for later use. After factoring, expressions to be synthesized

thus contain predicates as well as conditions. To illustrate the use of factoring, the example in

Figure 8.16 is presented. In Figure 8.16(a), predicate p1 is a subexpression of p2. Factoring C1+

C2, or p1, out of p2 allows more sharing of predicate de�nes between predicate computations.

As can be seen in Figure 8.16(b), this subexpression can be computed in cycle 1 using or-type

predicate de�nes. The availability of this expression before the computation of p2 allow an

e�cient application of factorization. In cycle 3, the conjunction of the subexpression stored in

p1 with the previous value of p2 and C3 is required. This expression is awkward to compute

using the PlayDoh predicate de�ne semantics because the logical combination of predicates is

not directly supported. With the extension to the PlayDoh predicate de�ne semantics, this


expression can be computed with a single conjunctive-type predicate de�ne. Figure 8.16(c)

shows the �nal set of predicate de�nes used to compute the factored predicate expressions.

The two expressions are computed using a total of two predicates and four predicate de�nes.

The last predicate de�ne conjoins p1 and C3 to the previous contents of p2 (C0) to �nish the

computation of the p2 expression.

Bene�t of Architectural Extension. The primary use of the conjunctive-type predicate

de�nes is to reduce the number of instructions required to compute factored expressions. This

reduction is best illustrated when the generation of the predicate expressions is done without

the conjunctive type. Figures 8.16(d) and 8.16(e) show two generation options that do not use

the conjunctive type. In Figure 8.16(d), no factorization is performed and the direct sum-of-

products expressions are computed. This approach requires a total of ten predicate de�nes,

six more instructions than was required in Figure 8.16(c). Further, the two-level nature of

the sum-of-products generation adds an extra level of dependence height. In Figure 8.16(e),

factorization is performed, but the conjunctive-type is not used. Here, a total of seven predicate

de�nes, three extra instrutions, is necessary. Of these, two predicate de�nes are needed to

compute the complement of the factored expression. This is done by applying DeMorgan's

theorem. Another method of complementing p1 could have been used, but it would have cost a

cycle of latency. The third extra predicate de�ne is used to nullify p2 if the complement of the

factored predicate is TRUE. Note that the disjunctive-type predicate de�nes are analogously

useful when product-of-sum expressions are used.

8.4.6 Experimental Results

The e�ectiveness of the Boolean minimization techniques for generating predicated code are

evaluated in this section. These techniques have been implemented within the IMPACT exper-

imental compiler framework and applied to a set of benchmarks.

Processor Model and Benchmarks. The processor modeled is an 8-issue processor with

in-order execution and register interlocking. The processor has no limitation on the combination

of instructions that may be issued each cycle, except that only one branch may be executed per

cycle. The instruction latencies assumed match those of the HP PA-7100 microprocessor. The

instruction set contains a set of non-trapping versions of all potentially excepting instructions,

with the exception of branch and store instructions, to support aggressive speculative execution.

The instruction set also contains support for predicated execution as described in Section 8.1.

The execution time for each benchmark was obtained using the IMPACT emulation-driven

simulator. Some dynamic e�ects such as branch mispredictions, cache misses, and TLB misses

were not measured. This decision was made to ensure that the experimental results highlight

the e�ects of the techniques being evaluated. Since the reformulation of the predicate decision

logic does not a�ect the basic nature of memory access patterns and branch histories, any

change in these dynamic e�ects between the original and optimized codes would be spurious in

nature.

The benchmarks used in this experiment consist of 13 non-numeric programs: four of the

SPECINT 92 benchmarks, 008.espresso, 022.li , 026.compress, 072.sc; six of the SPECINT 95

benchmarks, 099.go, 124.m88ksim, 126.gcc, 129.compress , 130.li , 132.ijpeg ; and three UNIX

utilities, cccp, lex , wc.

Results. The �rst set of results presented compare the performance of a code set trans-

formed with the described techniques to the performance of a baseline code set. The baseline

code consists of the best code generated by the IMPACT compiler for a predicated architecture

using hyperblock compilation techniques. The transformed code corresponds to the baseline


1.37

1.00

1.05

1.10

1.15

1.20

1.25

1.30

008.

espr

esso

022.

li

026.

com

pres

s

072.

sc

099.

go

124.

m88

ksim

126.

gcc

129.

com

pres

s

130.

li

132.

ijpeg

cccp lex

wc

Spee

dup

8-issue

8-issue, 256-preds

Figure 8.17: Speedup from minimization of program decision logic.

hyperblock code after Boolean minimization techniques are used to restructure the predicate

de�nes, and after the code is rescheduled. Performance is derived by computing the ratio of the

execution cycle count for the baseline code to that of the transformed code. The performance is

examined at two levels, �rst at the overall benchmark level and then at the benchmark function

level.

The overall benchmark speedups are presented in Figure 8.17. For each benchmark, two

results are reported. The �rst is the benchmark speedup on the target architecture. The

unweighted average speedup for all the benchmarks is 1.13. For some benchmarks, such as 022.li,

026.compress, 129.compress, and wc, the program decision height was signi�cantly limiting

performance throughout the most frequently executed portions of the code; when this height is

reduced by our techniques, speedups of around 1.2 are achieved.

The second result presented for each benchmark, labeled �8-issue, 256-preds,� is the speedup

on a hypothetical machine capable of issuing eight non-predicate-de�ne instructions and up to

256 predicate de�nes per cycle. The signi�cance of the second set of numbers is that they

re�ect only the dependence height of predicate de�nes, while eliminating their resource con-

sumption characteristics. These results suggest a logical upper bound for gains possible with

more e�ective factorization techniques. In most benchmarks, the optimizer produced a number

of predicate de�nes that was appropriate for the schedule and machine model. However, in four

benchmarks, 008.espresso, cccp, 126.gcc, and lex, the optimizer was unable to balance height

reduction with resource consumption and performance was penalized. This e�ect was very dra-

matic in 008.espresso because it is very decision height limited. Unfortunately, the excessive

optimization opportunity available in 008.espresso allowed the current minimization heuristic

to be overly aggressive in reducing height. With more advanced factorization techniques, the

number of predicate de�nes could be reduced in these instances, more closely approximating

the �8-issue, 256-preds� results.

Overall, the full benchmark results are encouraging. In most cases, the bene�t of our

technique was limited solely by the bottleneck created by program computation height. During

our experimental exploration, we observed that as optimizations which target computation

height were improved, the decision logic became dominant and relative speedups improved. In

particular, data and memory dependences seemed to hide much of the program decision height

reduction in many important hyperblocks. As the various components of compiler technology


Original Two-Level Synthesis Factored Synthesis

Benchmark, Function #pdi #pdi S(1) S(8) #pdi S(1) S(8)

008.espresso, essen_parts 39 1293 1.29 0.39 49 1.24 1.16

022.li, xleval 48 485 1.07 0.66 80 1.10 1.10

022.li, mark 42 67 1.48 1.48 53 1.50 1.48

026.compress, compress 60 456 1.20 1.03 221 1.23 1.23

072.sc, update 141 240 1.15 1.15 159 1.23 1.23

099.go, gete�ibs 98 1083 1.06 0.98 204 1.07 1.07

124.m88ksim, execute 41 47 1.12 1.12 40 1.12 1.12

124.m88ksim, goexec 176 175 1.10 1.09 155 1.09 1.08

124.m88ksim, load_data 42 54 1.30 1.30 53 1.30 1.30

124.m88ksim, loadmem 84 88 1.13 1.13 84 1.13 1.13

126.gcc, invalidate 89 202 1.27 1.24 125 1.22 1.21

126.gcc, �ow_analysis 64 92 1.77 1.69 58 1.86 1.86

126.gcc, canon_hash 89 149 1.88 1.20 116 1.90 1.74

129.compress, compress 63 154 1.21 1.21 98 1.26 1.26

130.li, mark 55 148 1.15 1.14 101 1.19 1.19

132.ijpeg, forward_DCT 31 47 1.46 1.35 32 1.46 1.43

cccp, skip_if_group 157 208 1.23 1.05 190 1.32 1.24

lex, cgoto 236 330 1.31 1.10 260 1.18 1.14

wc, main 56 48 1.22 1.31 48 1.22 1.22

Table 8.4: Speedup and predicate de�ne count for selected functions.

mature, the overall e�ectiveness of Boolean minimization will improve.

To better understand the e�ect program decision logic minimization has on complete

programs, we measured the performance and code size characteristics of a number of selected

functions. Table 8.4 examines the performance of one or more functions from each of the

benchmarks. These functions were chosen based on two criteria: signi�cant program execution

time and potential for optimization (e.g., the control height was signi�cant relative to the

computation height). The table compares the e�ectiveness of two strategies for program logic

transformation: two-level predicate synthesis and factorization. For each strategy, the static

number of predicate de�ne instructions, the performance gain on an 8-issue processor with

unconstrained predicate de�ne resources (1), and the performance gain on the 8-issue processor

are reported. In addition, the static number of predicate de�ne instructions in the code before

minimization is reported.

From the table, the two-level synthesis approach shows mixed results. For the uncon-

strained machine, the reduction in height translates directly into large speedups. However, the

unconstrained performance does not always translate into the same performance gain on the

8-issue processor. This is most pronounced in 008.espresso, essen_parts where the 1.16 speedup

is sharply reduced to 0.39. The primary reason for this behavior is the large increases in the

number of predicate de�ne instructions. The predicate de�nes that are created oversaturate

the processor resources and result in loss of performance. Correspondingly, when the number

of predicate de�nes is not increased by a large amount, the unconstrained performance does

indeed translate directly into performance on the 8-issue processor. Clearly, factored synthesis

is necessary for successful optimization of program decision logic.

As shown in the table, the factored approach yields both larger and more consistent

speedups. Both methods reduce the predicate computation height, but the factored approach

dramatically reduces the number of predicate de�nes required for the optimization. The func-

tion 126.gcc, canon_hash provides a good example of this behavior. Both methods achieve good

speedup for the unconstrained processor. However, the two-level synthesis approach requires

149 predicate de�nes to accomplish the improvement. For the 8-issue processor, most of the

8.5 Conclusion 117

performance gain is lost due to this increase in instructions. The factored approach reduces

the number of predicate de�nes to 116, increasing the 8-issue speedup to 1.74. The number

of predicate de�nes is still more than the original 89. Note, however, that simply increasing

the number of predicate de�nes from the original code is not necessarily viewed as a negative.

Boolean minimization approaches do this systematically to improve performance by identifying

condition subexpressions that can be computed early. This allows the �nal predicate to be

made available as soon as possible after the �nal condition is ready. However, the factored ap-

proach is consistently more e�ective because it factors predicate expressions into multiple-level

structures which are less demanding of processor resources than two-cycle evaluations. Another

interesting result is that for some functions such as update from 072.sc the factored synthesis

method outperforms the two-level method, even at in�nite issue. This is a due to the abil-

ity of the factorizer to generate expressions in one cycle rather than the two usually required

by the two-level synthesis approach. The �nal experiment examines the e�ectiveness of the

new predicate types (conjunctive and disjunctive, described in Section 8.1) in the context of

Boolean minimization and justi�es the need for the proposed architectural extensions. Table 8.5

presents the e�ects of the new predicate de�ne types on the speedup for an 8-issue processor,

the dynamic predicate de�ne count, and the static predicate de�ne count. The conjunctive and

disjunctive types allow certain important logical combinations of predicates and conditions to

be expressed more e�ciently. For all functions except 022.li, mark and 130.li, mark, the per-

formance gained from the program decision logic optimization is diminished when the proposed

predicate de�ne types are not available. Further, in six of the nineteen functions, the perfor-

mance improvement is converted into a performance loss. The most dramatic example of this is

126.gcc, �ow_analysis, in which a 46% performance improvement becomes an 8% performance

degradation. The lack of the new predicate de�ne types in the target architecture also causes a

code size penalty. In general, the additional predicate types allow signi�cant reductions in both

the static and dynamic predicate de�ne counts. In one case, 74% more predicate de�nes are

required if the new types are not available. Six functions do not exhibit this penalty. In these

functions, the majority of the predicate expressions are sums of single term �products� making

the conjunctive-type unnecessary for instantiating these functions.

8.5 Conclusion

This chapter gave the potential bene�ts that can lead the introduction of predicated execution

into embedded processor in terms of both control �ow optimization and code size issue.

The proposed pre�x-based predicated execution architecture framework has the potential

to signi�cantly enhance the e�ectiveness of introducing predicated execution into embedded

microprocessors. For regions of non-predicated code, the pre�x-based method o�ers better

code density characteristics than traditional models of predication support. For predicated

regions, the pre�x-based method o�ers performance improvement over an architecture without

predication support. It was illustrated that an optimizing compiler can enhance the pre�x-

based predication model by performing aggressive instruction merging and predicate promotion

to reduce the number of predicated instructions by 30%. Overall, pre�x-based predication

achieves 12% performance improvement for code created with superscalar optimization and

reduces code size by 25%.

Also, a new method for optimizing programmatic control �ow was presented. This ap-

proach provides a systematic methodology for reformulating program control �ow for more

e�cient execution on ILP processors. Control expressed through branches and predicate de-

�nes is extracted and represented as a program decision logic network . Boolean minimization


Pred. Def. Count

Speedup (8) Penalty w/o ^t/^f

Benchmark, Function with without dynamic static

008.espresso, essen_parts 1.16 0.96 17.2% 17.8%

022.li, xleval 1.10 1.08 35.4% 35.0%

022.li, mark 1.48 1.48 11.5% 11.3%

026.compress, compress 1.23 1.13 59.8% 60.2%

072.sc, update 1.23 0.98 4.3% 5.0%

099.go, gete�ibs 1.07 1.06 17.1% 21.1%

124.m88ksim, execute 1.12 0.89 16.9% 10.0%

124.m88ksim, goexec 1.08 0.90 6.3% 6.5%

124.m88ksim, load_data 1.30 1.07 15.3% 11.3%

124.m88ksim, loadmem 1.13 1.02 74.1% 14.3%

126.gcc, invalidate 1.14 0.77 30.3% 22.4%

126.gcc, �ow_analysis 1.86 0.93 0.1% 0.0%

126.gcc, canon_hash 1.74 1.60 11.4% 10.5%

129.compress, compress 1.26 1.10 53.4% 35.7%

130.li, mark 1.19 1.19 18.2% 17.8%

132.ijpeg, forward_DCT 1.43 1.33 0.0% 0.0%

cccp, skip_if_group 1.24 1.20 16.8% 14.2%

lex, cgoto 1.14 1.07 4.7% 10.8%

wc, main 1.22 1.16 4.2% 4.2%

Table 8.5: E�ects of conjunctive-type predicate de�nes on speedup and instruction count.

techniques are applied to the network both to reduce dependence height and to simplify the

component expressions. Redundancy is controlled by employing a schedule-sensitive factoriza-

tion technique to identify intermediate logical combinations of conditions that can be shared.

After optimization, the network is reformulated into predicated code.

An extension to the HPL PlayDoh model of predication that allows more e�cient com-

putation of the predicate expressions produced by the minimization techniques, namely the

conjunctive and disjunctive predicate assignment types was also presented. Experimental re-

sults show that in blocks of predicated code with signi�cant control height, the application of

logic minimization techniques together with these architectural enhancements provides substan-

tial performance bene�t. Across the benchmarks studied, program decision logic minimization

provided an average overall speedup of 1.13 for an 8-issue processor. The new predicate assign-

ment types were also shown to signi�cantly reduce the number of predicate de�ne instructions

required. As compiler technology progresses to make more extensive and e�ective use of pred-

icated code, minimization of program decision logic is likely to become an increasingly more

important part of total program optimization.

Chapter 9

Conclusion

This thesis investigated the bene�ts of synergistic hardware-compiler ILP architectures for

low-power processors. New solutions were proposed to integrate multiple-issue pipelines

into mobile architecture, and a detailed analysis of the de�ned systems was done.

Chapter 2 gave a brief survey of the major ILP techniques that are used to enhance

performance. The main concepts of ILP were introduced and several architectures that exploit

ILP were described. Furthermore, the compiler support for such architectures was presented.

Chapter 3 strongly motivated the use of parallelism for the design of an energy e�cient

microprocessor. First, it gave an introduction to power consumption in CMOS circuits. Then,

several metrics and their meaning were described. Finally, the e�ect of parallelism on such

metrics was explained.

At this point an overview of the state of the art in low-power 32-bit mobile processors was

given in Chapter 4. It points out that, generally, for embedded processors, ILP is exploited

only through pipelining techniques, and, surprisingly, VLIW architectures have not yet been

introduced in the mobile processor market even though their inherent simplicity can o�er low

power consumption and improved performance compared to scalar architectures.

Chapter 5 described a high-level evaluation of the bene�ts of VLIW for low-power proces-

sors. It was demonstrated, through the use of high-level power consumption estimates, that the

introduction of VLIW architectures into low-power embedded 8-bit or 16-bit microcontrollers

yields a signi�cant improvement of the energy e�ciency during inner loop execution.

Motivated by these experimental results, Chapter 6 proposed a new VLIW architecture

called, DEVIL, that targets the low-power mobile processor market. DEVIL includes a new fetch

mechanism that encodes explicitly the parallelism within an instruction bundle and supports

a variable instruction mechanism. It was demonstrated that this mechanism allows savings of

up to 50% in the code size as compared to a standard VLIW fetch mechanism while keeping

performance unchanged. This is an important result since the cost, a central point of embedded

systems, depends directly on the code size. Furthermore, this fetch mechanism allows a signi�-

cant reduction of the memory tra�c (approximately 16%), proportionally decreasing the power

consumption required for the instruction fetches. The e�ect of superscalar optimizations on

performance were also investigated. It was shown that superscalar optimizations are required

to achieve good performance levels, implying a big increase in code size (58% on average) due

to code duplication (e.g., tail duplication). The e�ects of code size expansion are minimized

through the compaction technique o�ered by the DEVIL architecture. Note that the compiler

was not tuned to minimize code size and that DEVIL's partial predication support was not

used, meaning this code size can be much more optimized. In terms of performance, DEVIL

speeds up the execution time by a factor of 1.5 on average as compared to a scalar processor.

119

120 Conclusion

This performance enhancement allows lower clock frequencies and power supply voltages, thus

reducing the circuit's power consumption.

In order to obtain accurate estimates of DEVIL's features in terms of complexity, circuit

speed, and power consumption, Chapter 7 described an implementation of the DEVIL proces-

sor. Using this implementation, estimates of the circuit's complexity, speed, and circuit power

consumption were computed, so as to complete the evaluation the bene�ts of VLIW architec-

tures for low-power processors. In terms of circuit speed, DEVIL runs at 50 MHz, which is

quite slow for a 0.25� technology. This is due to the synthesis methodology, as well as the lack

of resources to optimize DEVIL's datapath. The complexity of DEVIL was estimated to be

around 125'000 transistors, categorizing DEVIL as a simple circuit that should have a small

die area; this shows that the simplicity of VLIW architectures is well adapted to embedded

systems. It was also shown that the dispatch unit introduced to handle the variable instruc-

tion length increases the circuit complexity by only 4%, which is negligible compared to the

demonstrated bene�ts of such a mechanism. According to the estimated features of the design

and of the overhead introduced by the VLIW architecture, it was shown that parallelism allows

an improvement of the energy e�ciency of about 38% on average. This confers to DEVIL the

attractive possibility to execute code at the same speed as that of a scalar processor while

consuming much less power.

These results clearly validate the bene�ts of VLIW architectures for low-power mobile pro-

cessors. The major drawback is the code size penalty caused by the use of high-level languages

and superscalar optimizations. Chapter 8 made a �rst step toward predicated execution for

embedded processor and proposed new solutions to optimize the code size and the control �ow.

A pre�x-based predicated execution architecture framework has been proposed that has

the potential to signi�cantly enhance the e�ectiveness of introducing predicated execution into

embedded microprocessors. Overall, pre�x-based predication achieves a 12% performance im-

provement for code created with superscalar optimizations and reduces code size by 25%. Also,

a new method for optimizing a program's control �ow was presented. This approach provides

a systematic methodology for reformulating program control �ows for more e�cient execution

on ILP processors. Control expressed through branches and predicate de�nes is extracted and

represented as a program decision logic network . Boolean minimization techniques are applied

to the network both to reduce dependence height and to simplify the component expressions.

Redundancy is controlled by employing a schedule-sensitive factorization technique to identify

intermediate logical combinations of conditions that can be shared. After optimization, the

network is reformulated into predicated code.

An extension to the HPL PlayDoh model of predication that allows more e�cient com-

putation of the predicate expressions produced by the minimization techniques, namely the

conjunctive and disjunctive predicate assignment types, was also presented. Experimental re-

sults show that, in blocks of predicated code with signi�cant control height, the application of

logic minimization techniques together with these architectural enhancements provides a sub-

stantial performance bene�t. Across the benchmarks studied, program decision logic minimiza-

tion provided an average overall speed-up of 1.13 for an 8-issue processor. The new predicate

assignment types were also shown to signi�cantly reduce the number of predicate de�ne instruc-

tions required. As compiler technology progresses to make more extensive and e�ective use of

predicated code, the minimization of program decision logic is likely to become an increasingly

important part of the total program optimization.

Conclusion 121

Future Work

During this thesis, a �rst prototype of a synergistic processor-compiler system was built. This

prototype allowed to demonstrate the bene�ts of ILP architectures for low-power mobile pro-

cessors. However, the road is long before the fabrication of a commercial product can become

feasible.

DEVIL's design is currently a prototype and needs many more time-consuming design

optimizations. The current design estimates will allow to direct power consumption and critical

path reduction optimizations. Such optimizations should result in a lower power consumption

and a higher circuit speed, conferring to DEVIL a very attractive energy-e�ciency/performance

ratio. A place and route of the circuit should then be realized to obtain accurate transistor-level

estimates.

The current implementation of DEVIL su�ers from low circuit speed of 50 MHz. Even

with further optimizations, the maximum speed will be limited by the 3-stage pipeline. An

interesting step would be to loosen this constraint by increasing DEVIL's pipeline depth from

3 to 5 stages, allowing the processor to reach much higher clock frequencies. However, the

increase in complexity in terms of branch prediction and operand bypassing should be carefully

considered. From the estimates extracted from the DEVIL implementation, a 5-stage pipeline

would have a good potential to improve the energy e�ciency of DEVIL.

As stated at the beginning of this work, DEVIL is to be used in synergy with a compiler.

All the bene�ts of the DEVIL architecture rely on the quality of the code generated by the

compiler (remember that the parallelism is extracted at compile time). The current DEVIL's

compiler is based on a new back-end for the IMPACT compiler, allowing it to generate high-

performance code. However, current optimizations incur a code size penalty that can a�ect the

total system cost, and therefore an e�ort should be made to optimize the IMPACT back-end

to DEVIL to generate smaller code while keeping the performance at a same level.

Currently not all the features o�ered by DEVIL are supported by the back-end. For exam-

ple, DEVIL has a partial predication support that can lead to a signi�cant code size reduction

with no performance loss. This feature can be used, for example, to avoid some tail duplications

during superblock formation. Also, the possibility of having operations that shift one of their

operands at no cost is not yet supported, but could also increase the code density.

DEVIL implements a very simple instruction set, allowing a fast and low-power pipeline.

However, the lack of complex instructions, such as multiply and accumulate or count lead-

ing zeros, sometimes results in a severe performance degradation. As DEVIL use a synthesis

methodology, an interesting approach would be to reserve some of the available opcodes for in-

struction set specialization. The idea is to extend DEVIL to detect these de�nable instructions

and to send them to a dedicated datapath. The processor should be resinthesized with the new

functionalities. This addition of an extra unit should be easy thanks to the modularity of the

VLIW architecture.

There are two ways to exploit this specialization mechanism. The �rst, and probably the

simplest one, is to have a library of coprocessor units that implement dedicated functions in

hardware, along with the compiler support to generate code using these new instructions. An

embedded system designer, knowing the requirements of the targeted application, could then

customize DEVIL and its compiler using the prede�ned library elements. Once the system

is tested and simulated, a new chip could be implemented with the adapted functionalities.

122 Conclusion

The second method is much more challenging and goes in the direction of software-hardware

codesign. The idea is to let the compiler decided how DEVIL's instruction set should be cus-

tomized. Once the decision is made, the compiler should generate the assembly code including

the custom instructions as well as the VHDL code that implements the custom functional unit.

In conclusion, ILP helps to improve the energy e�ciency of low-power embedded proces-

sors. The results obtained with our �rst prototype are very satisfactory and many improvements

are not only possible, but also fairly simple to introduce, motivating further investigations and

developments in this direction.

Appendix A

The DEVIL's Instruction Set Summary

A.1 Functions De�nition

<c> The operation can be conditionnally executed on T or T

SHF(op) The operand op can be shifted, all the shifter functionalities are available

Sext(op) The operand op is sign extended

Zext(op) The operand op is extended with zeros

Oext(op) The operand op is extended with ones

Zend(op) The operand op is ended with zeros

Oend(op) The operand op is ended with ones

123

124 The DEVIL's Instruction Set Summary

A.2 Arithmetical Operations

15-bit version 30-bit version Description

subi rsd, imm:5 subi rd, rs, imm:16 subtract immediat (unsigned)

rsd = rsd - imm:5 rd = rs - imm:16

addi rsd, imm:5 addi rd, rs, imm:16 add immediat (unsigned)

rsd = rsd + imm:5 rd = rs + imm:16

shli rsd, imm:5 shli rd, rs, imm:5 logical shift left

rsd = rsd � imm:5 rd = rs � imm:5 <c>

shri rsd, imm:5 shri rd, rs, imm:5 logical shift right

rsd = rsd � imm:5 rd = rs � imm:5 <c>

ashri rsd, imm:5 ashri rd, rs, imm:5 arithmetical shift right

rsd = rsd � imm:5 rd = rs � imm:5 <c> signed

sub rsd, rs sub rd, rs1, rs2 subtract registers

rsd = rsd - rs rd = SHF(rs1) - rs2 <c>

add rsd, rs add rd, rs1, rs2 add register

rsd = rsd + rs rd = SHF(rs1) + rs2 <c>

subsp imm:8 subsp imm:20 subtract immediate value to SP

SP = SP - imm:8 SP = SP - imm:20 unsigned

addsp imm:8 addsp imm:20 add immediate value to SP

SP = SP + imm:8 SP = SP + imm:20 unsigned

shl rsd, rs shl rd, rs1, rs2 logical shift left

rsd = rsd � rs rd = rs1 � rs2 <c>

shr rsd, rs shr rd, rs1, rs2 logical shift right

rsd = rsd � rs rd = rs1 � rs2 <c>

ashr rsd, rs ashr rd, rs1, rs2 arithmetical shift right

rsd = rsd � rs rd = rs1 � rs2 <c> signed

neg rd, rs neg rd, rs two-complement

rd = -rs rd = -rs <c>

cast.8 rd, rs cast.8 rd, rs cast to 8-bit unsigned

rd = ZeroExt(rs.8) rd = Zext(rs.8) <c>

cast.16 rd, rs cast.16 rd, rs cast to 16-bit unsigned

rd = ZeroExt(rs.16 rd = Zext(rs.16) <c>

ext.8 rd, rs ext.8 rd, rs byte (8-bit) sign extention

rd = SignExt(rs.8) rd = Sext(rs.8) <c>

ext.16 rd, rs ext.16 rd, rs word (16-bit) sign extension

rd = SignExt(rs.8) rd = Sext(rs.8) <c>

nop nop no operation

Table A.1: DEVIL's arithmetical instructions

A.3 Logical Operations 125

A.3 Logical Operations


- ori.l rd, rs, imm:16 logical OR

rd = rs | Zext(imm:16)

- andi.l rd, rs, imm:16 logical AND

rd = rs & Oext(imm:16)

- xori.l rd, rs, imm:16 logical XOR

rd = rs xor Zext(imm:16)

- ori.h rd, rs, imm:16 logical OR

rd = rs | Zend(imm:16)

- andi.h rd, rs, imm:16 logical AND

rd = rs & Oend(imm:16)

- xori.h rd, rs, imm:16 logical XOR

rd = rs xor Zend(imm:16)

or rsd, rs or rd, rs1, rs2 logical OR

rsd = rsd | rs rd = SHF(rs1) | rs2 <c>

xor rsd, rs xor rd, rs1, rs2 logical XOR

rsd = rsd xor rs rd = rs1 xor rs2

and rsd, rs and rd, SHF(rs1), rs2 <c> logical AND

rsd = rsd & rs rd = rs1 & rs2

not rd, rs not rd, rs bit inversion

rd = not(rs) rd = not(rs) <c>

Table A.2: DEVIL's logical instructions


A.4 Compare Operations


testi.eq rs, imm:5 testi.eq rs, imm:20 test if equal

T = (rs == Sext(imm:5)) T =(rs == Sext(imm:20)) signed

testi.lt rs, imm:5 testi.lt rs, imm:20 test if less


testi.le rs, imm:5 testi.le rs, imm:20 test if less or equal


testi.sm rs, imm:5 testi.sm rs, imm:20 test if smaller

T = (rs == Zext(imm:5)) T =(rs == Zext(imm:20)) unsigned

testi.ss rs, imm:5 testi.ss rs, imm:20 test if smaller or equal

T = (rs == Zext(imm:5)) T =(rs == Zext(imm:20)) unsigned

btest rs, imm:5 btest rs, imm:5 bit test

T = rs(bit(imm:5)) T = rs(bit(imm:5))

test.eq rs1, rs2 test.eq rs1, rs2 test if equal

T = (rs1 == rs2) T = (SHF(rs1) == rs2) <c>

test.lt rs1, rs2 test.lt rs1, rs2 test if less

T = (rs1 < rs2) T = (SHF(rs1) < rs2) <c> signed

test.le rs1, rs2 test.le rs1, rs2 test if less or equal

T = (rs1 <= rs2) T = (SHF(rs1) <= rs2) <c> signed

test.sm rs1, rs2 test.sm rs1, rs2 test if smaller

T = (rs1 < rs2) T = (SHF(rs1) < rs2) <c> unsigned

test.ss rs1, rs2 test.ss rs1, rs2 test if smaller or equal

T = (rs1 <= rs2) T = (SHF(rs1) <= rs2) <c> unsigned

Table A.3: DEVIL's compare instructions

A.5 Move Operations 127

A.5 Move Operations


mov rd, rs mov rd, rs move register

rd=rs rd=rs

ldi rd, imm:6 ldi rd, imm:20 move immediate

rd = Sext(imm:6) rd = Sext(imm:20) signed

cmovt rd, rs cmovt rd, rs conditional move

rd=rs if T rd=rs if T

cmovnt rd, rs cmovnt rd, rs conditional move

rd=rs if not(T) rd=rs if not(T)

movt rd movt rd move T �ag to reg

rd = Zext(T) rd = Zext(T)

movnt rd movnt rd move not(T) �ag to reg

rd = Zext(not(T)) rd = Zext(not(T))

mov2mac mac, rs mov2mac rs move reg to macro register

mac = rs mac = rs

movmac rd, mac mov2mac rd, mac move macro register to reg

rd = mac rd = mac

ion ion enable interuptions

io� io� disable interuptions

Table A.4: DEVIL's move instructions


A.6 Branch Operations


jt_nt disp:10 jt_nt disp:25 jump if T, nullify next if taken

jt_nn disp:10 jt_nn disp:25 jump if T, nullify next if not taken

jnt_nt disp:10 jnt_nt disp:25 jump if not(T), nullify next if taken

jnt_nn disp:10 jnt_nn disp:25 jump if not(T), nullify next if not taken

jmp disp:10 jmp disp:25 unconditional jump

jsr disp:10 jsr disp:25 jump subroutine, save PC

jt_nt rs jt_nt rs jump if T, nullify next if taken

jt_nn rs jt_nn rs jump if T, nullify next if not taken

jnt_nt rs jnt_nt rs jump if not(T), nullify next if taken

jnt_nn rs jnt_nn rs jump if not(T), nullify next if not taken

jmp rs jmp disp:25 unconditional jump

jsr rs jsr disp:25 jump subroutine, save PC

ret ret return from subroutine

reti reti return fm interrupt

Table A.5: DEVIL's branch instructions

A.7 Data Memory Operations 129

A.7 Data Memory Operations


ld.8u rd, rs ld.8u rd, rs1, rs2 load unsigned byte

rd = Zext(mem.8[rs]) rd = Zext(mem.8[rs1 + rs2])

ld.8 rd, rs ld.8 rd, rs1, rs2 load signed byte

rd = Sext(mem.8[rs]) rd = Sext(mem.8[rs1+rs2])

ld.16u rd, rs ld.16u rd, rs1, rs2 load unsigned half word

rd = Zext(mem.16[rs]) rd = Zext(mem.16[rs1+rs2])

ld.16 rd, rs ld.16 rd, rs1, rs2 load signed half word

rd = Sext(mem.16[rs]) rd = Sext(mem.16[rs1+rs2])

ld.32 rd, rs ld.32 rd, rs1, rs2 load signed word

rd = mem.32[rs] rd = mem.32[rs1 + rs2]

st.8 rs1, rs2 st.8 rs1, rs1, rs3 store byte

mem.8[rs1] = rs3 mem.8[rs1+s2] = rs3

st.16 rs1, rs2 st.16 rs1, rs1, rs3 store half word

mem.16[rs1] = rs2 mem.16[rs1+rs2]=rs3

st.32 rs1, rs2 st.32 rs1, rs1, rs3 store word

mem.32[rs1] = rs2 mem.32[rs1+rs2]=rs3

ld.8u rd, imm:5 ld.8 rd, rs1, imm:16 load signed byte

rd = Zext(mem.8[SP+imm:5]) rd = Sext(mem.8[rs1+imm:16])

ld.8 rd, rs ld.16u rd, rs1, rs2 load unsigned half word

rd = Sext(mem.8[SP+imm:5]) rd = Zext(mem.16[rs1+imm:16])

ld.16u rd, rs ld.16 rd, rs1, rs2 load signed half word

rd = Zext(mem.16[SP+2*imm:5]) rd = Sext(mem.16[rs1+2*imm:16])

ld.32 rd, rs ld.32 rd, rs1, rs2 load signed word

rd = mem.32[SP+4*imm:5] rd = mem.32[rs1 + 4*imm:16]

st.8 rs1, rs2 st.8 rs1, rs1, rs3 store byte

mem.8[SP+imm:5] = rs3 mem.8[rs1+imm:16] = rs3

st.16 rs1, rs2 st.16 rs1, rs1, rs3 store half word

mem.16[SP+2*imm:5] = rs2 mem.16[rs1+2*imm:16]=rs3

st.32 rs1, rs2 st.32 rs1, rs1, rs3 store word

mem.32[SP+4*imm:5] = rs2 mem.32[rs1+4*imm:16]=rs3

Table A.6: DEVIL's load/store instructions



- ld.8 rd, label:15 load signed byte

rd = Sext(mem.8[label:15])

- ld.16u rd, label:15 load unsigned half word

rd = Zext(mem.16[label:15])

- ld.16 rd, label:15 load signed half word

rd = Sext(mem.16[label:15])

- ld.32 rd, label:15 load signed word

rd = mem.32[label:15]

- st.8 rs, label:15 store byte

mem.8[label:15] = rs

- st.16 rs, label:15 store half word

mem.16[label:15]=rs

- st.32 rs, label:15 store word

mem.32[label:15]=rs

Table A.7: DEVIL's load/store instructions (second part)

Bibliography

[1] Chart Watch: Mobile Processors. Microprocessor Report, March 29, 1999.

[2] Chart Watch: Workstation Processors. Microprocessor Report, January 25, 1999.

[3] Leadership in DSP Technology for Communication Applications. http://starcore-dsp.com,

1999.

[4] Motoroloa and Lucent Unveil First O�ering From Star*Core Joint DSP Design Team.

press release, April 1999. http://starcore-dsp.com.

[5] S. B. Akers. Binary Decision Diagrams. IEEE Transaction on Computers, C-27(8):509�516,

June 1978.

[6] J. R. Allen, K. Kennedy, C. Porter�eld, and J. Warren. Conversion of control depen-

dence to data dependence. In Proceedings of the 10th ACM Symposium on Principles of

Programming Languages, pages 177�189, January 1983.

[7] D. I. August, D. A. Connors, S. A. Mahlke, J. W. Sias, K. M. Crozier, B. Cheng, P. R.

Eaton, Q. B. Olaniran, and W. W. Hwu. Integrated predication and speculative execution

in the IMPACT EPIC architecture. In Proc. of the 25th International Symposium on

Computer Architecture, June 1998.

[8] D. I. August, W. W. Hwu, and S. A. Mahlke. A framework for balancing control �ow and

predication. In Proceedings of the 30th Annual International Symposium on Microarchitec-

ture, December 1997.

[9] Eduard Ayguade, Cristina Barrado, Antonio Gonzalez, Jesus Labarta, David Lopez, Josep

Llosa, Susana Moreno, David Padua, Fermin J. Reig, and Mateo Valero. Ictineo: A Tool

for Research on ILP. In Supercomputing'96, November 1996.

[10] Rich Belgard. Transmeta Exposed. Microprocessor Report, March 8, 1999.

[11] M. Berry, D. Chen, P. Koss, and D. Kuck. The Perfect Club Benchmarks: E�ective Perfor-

mance Evaluation of Supercomputers. Technical Report 827, Center for Supercomputing

Research and Development, November 1988.

[12] R. Bodik, R. Gupta, and M. L. So�a. Interprocedural conditional branch elimination. In

Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design

and Implementation, pages 146�158, June 1997.

[13] R. E. Bryand. Graph-based algorithms for boolean function manipulation. IEEE Trans-

action on Computers, C-35(8):677�691, August 1986.

131

132 Bibliography

[14] R. E. Bryant. Symbolic boolean manipulation with ordered binary decision diagrams. Tech-

nical Report CMU-CS-92-160, Carnegie Mellon University, School of Computer Science,

Carnegie Mellon University, Pittsburgh, PA, October 1992.

[15] T. D. Burd and R. W. Brodersen. Processor Design for Portable Systems. Journal of VLSI

Signal Processing, 13(2/3):203�222, August/September 1996.

[16] Thomas D. Burd and Robert W. Brodersen. Energy e�cient CMOS microprocessor design.

In Proceedings of the 28th Annual HICSS Conference, volume 1, pages 288�297, January

1995.

[17] Brian Case. Philips Hopes to Displace DSPs with VLIW. Microprocessor Report, 8(16),

December 1994.

[18] G.H. Chaitin. Register Allocation and Spilling Via Graph Coloring. In Proc., ACM SIG-

PLAN Symp. on Compiler Construction, pages 98�105, June 1982.

[19] Anantha P. Chandrakasan and Robert W. Brodersen. Low Power Digital CMOS Design.

Kluwer Academic Publisher, 1995.

[20] Pohua P. Chang, Scott A. Mahlke, and Wen mei W. Hwu. Using Pro�le Information to As-

sist Classic Compiler Code Optimizations. Software Practice and Experience, 21(12):1301�

1321, December 1991.

[21] Enric Musoll Cinca. High-Level and Logic Synthesis Techniques for Low Power. PhD

thesis, Universitat Politènica de Catalunya, July 1996.

[22] V. Kathail et al. HPL PlayDoh architecture speci�cation: Version 1.0. Technical Report

HPL-93-80, Hewlett-Packard Laboratories, Palo Alto, CA, February 1994.

[23] Joseph A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE

Transansaction on Computers, c-30:478�490, July 1981.

[24] Ricardo Gonzalez and Mark Horowitz. Energy Dissipation In General Purpose Micropro-

cessors. IEEE Journal of Solid-State Circuits, 31(9):1277�1283, September 1996.

[25] Linley Gwennap. Intel, HP Make EPIC Disclosure. Microprocessor Report, 11(14), October

1997.

[26] Linley Gwennap. ARM10 Points to Set-Tops, Handhelds. Microprocessor Report, Novem-

ber 16, 1998.

[27] Linley Gwennap. Intel Discloses New IA-64 Features. Microprocessor Report, March 8,

1999.

[28] Tom R. Halfhill. Fujitsu FR-V Architecture Bets On VLIW. Microprocessor Report, 13(10),

August 1999.

[29] John L. Hennessy and David A. Patterson. Computer Architecture: a quantitative approach.

Morgan Kaufmann, 1996.

[30] Hitachi. The SH7750 Reference Manual.

[31] P. Y. Hsu and E. S. Davidson. Highly concurrent scalar processing. In Proceedings of the

13th International Symposium on Computer Architecture, pages 386�395, June 1986.

Bibliography 133

[32] W. W. Hwu and Y. N. Patt. HPSm, a high performance restricted data �ow architec-

ture having minimal functionality. In Proceedings of the 13th International Symposium on

Computer Architecture, pages 297�306, June 1986.

[33] Mike Johnson. Superscalar Miprocessor Design. Prentice-Hall, 1991. ISBN 0-13-875634-1.

[34] D. J. Kuck. The Structure of Computers and Computations. John Wiley and Sons, New

York, NY, 1978.

[35] C. Lee, M. Potkonjak, and W. Mangione-Smith. MediaBench: A tool for evaluating and

synthesizing multimedia and communications systems. In Proceedings of the 30th Annual

International Symposium on Microarchitecture, pages 330�335, December 1997.

[36] J. Llosa, M. Valero, and Ayguadé. Quantitative Evaluation of Register Pressure on Software

Pipelined Loops. International Journal of Parallel Programming, 26(2):121�142, 1998.

[37] J. Llosa, M. Valero, and E. Ayguadé. Heuristics for Register-Constrained Software Pipelin-

ing. In Proc. of the 29th Ann. Int. Symp. on Microarchitecture (MICRO-29), pages 250�261,

December 1996.

[38] Josep Llosa. Reducing the Impact of Register Pressure on Software Pipelined Loops. PhD

thesis, Universitat Politenica de Catalunya, January 1996.

[39] Josep Llosa, Antonio Gonzalez, Eduard Ayguade, and Mateo Valero. Swing Modulo

Scheduling: A Lifetime-Sensitive Approach. In Parallel Architectures and Compilation

Techniques (PACT'96), pages 80�86, October 1996.

[40] P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, R. P. Nix, J. S.

O'Donnell, and J. C. Ruttenberg. The Multi�ow Trace Scheduling Compiler. The Journal

of Supercomputing, 7(1):51�142, January 1993.

[41] S. A. Mahlke, W. Y. Chen, R. A. Bringmann, R. E. Hank, W. W. Hwu, B. R. Rau,

and M. S. Schlansker. Sentinel Scheduling: A Model for Compiler-Controlled Speculative

Execution. ACM Transactions on Computer Systems, 11(4), November 1993.

[42] S. A. Mahlke, W. Y. Chen, P. P. Chang, and W. W. Hwu. Scalar program performance

on multiple-instruction-issue processors with a limited number of registers. In Proceedings

of the 25th Annual Hawaii International Conference on System Sciences, pages 34�44,

January 1992.

[43] S. A. Mahlke, R. E. Hank, J.E. McCormick, D. I. August, and W. W. Hwu. A comparison

of full and partial predicated execution support for ILP processors. In Proceedings of the

22th International Symposium on Computer Architecture, pages 138�150, June 1995.

[44] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. E�ective

Compiler Support for Predicated Execution Using the Hyperblock. In Proceedings of the

25th International Symposium on Microarchitecture, pages 45�54, December 1992.

[45] Wen mei W. Hwu, Scott A. Mahlke, William Y. Chen, Pohua P. Chang, Nancy J. Warter,

Roger A. Bringmann, Roland G. Ouellette, Richard E. Hank, Tokuzo Kiyohara, Grant E.

Haab, John G. Holm, and Daniel M. Lavery. The Superblock: An E�ective Technique

for VLIW and Superscalar Compilation. The Journal of Supercomputing, pages 229�248,

1993. Kluwer Academic Publishers.

134 Bibliography

[46] Motorola Inc. MMC2001 Reference Manual, 1998.

[47] F. Mueller and D. B. Whalley. Avoiding conditional branches by code replication. In

Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and

Implementation, pages 55�66, June 1995.

[48] J. C. Park and M. S. Schlansker. On predicated execution. Technical Report HPL-91-58,

Hewlett Packard Laboratories, Palo Alto, CA, May 1991.

[49] Philips. TM1000 product pro�le. http://www.semiconductors.com/trimedia/products/.

[50] Christian Piguet, Jean-Marc Masgonty, Claude Arm, Serge Durand, Thierry Schneider,

F lavio Rampogna, Ciro Scarnera, Christian Iseli, Jean-Paul Bardyn, R. Pache, and Evert

Dijkstra. Low-Power Design of 8-b Embedded CoolRISC Microcontroller Cores. IEEE

Journal Of Solid-State Circuits, 32(7):1067�1078, July 1997.

[51] D. N. Pnevmatikatos and G. S. Sohi. Guarded execution and branch prediction in dy-

namic ILP processors. In Proceedings of the 21st International Symposium on Computer

Architecture, pages 120�129, April 1994.

[52] Jean-Michel Puiatti, Christian Piguet, Eduardo Sanchez, and Josep Llosa. Low-power

VLIW Processors: A High-level Evaluation. In 8th International Workshop on Power and

Timing Modeling, Optimization and Si mulation (PATMOS'98), October 1998.

[53] B. R. Rau and J. A. Fisher, editors. Instruction-Level Parallelism, volume 7. Kluwer

Academic Publishers, 1993. A Special Issue of The Journal of Supercomputing.

[54] B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle. The Cydra 5 departmental super-

computer. IEEE Computer, 22(1):12�35, January 1989.

[55] M. Schlansker and V. Kathail. Acceleration of �rst and higher order recurrences on pro-

cessors with instruction level parallelism. In Proceedings of Languages and Compilers for

Parallel Computing, 6th International Workskop, August 1993.

[56] M. Schlansker and V. Kathail. Critical path reduction for scalar programs. In Proceedings

of the 28th International Symposium on Microarchitecture, pages 57�69, December 1995.

[57] J. Scott and B. Moyer L. Hwang Lee, J. Arends. Designing the low-power M-CORE

architecture. In Power Driven Microarchitecture Workshop, pages 102�106, June 1998.

http://www.cs.colorado.edu/�grunwald/LowPowerWorkshop/agenda.html.

[58] Dezsö Sima, Terence Fountain, and Péter Kacsuk. Advanced Computer Architectures: A

Design Space Approach. Addison Wesley Longman, 1997. ISBN 0-201-42291-3.

[59] J. E. Smith. A study of branch prediction strategies. In Proceedings of the 8th International

Symposium on Computer Architecture, pages 135�148, May 1981.

[60] James E. Smith and Andrew R. Pleszkun. Implementation of Precise Interrupts in

Pipelined Processors. In Proc. 12th Annual Symposium on Computer Architecture, pages

36�44, June 1985.

[61] Peter Song. M-Core for the Portable Millennium. Microprocessor Report, February 16,

1998.

Bibliography 135

[62] Texas Instrument. The TMS320C6201 Reference Manual.

http://www.ti.com/sc/docs/products/dsp/tms320c6201.html.

[63] James L. Turley. Thumb Squeezes ARM Code Size. Microprocessor Report, 9(4), March

1995.

[64] Jim Turley. Hitachi sh-3 hits 100 mips. Microprocessor Report, 9(3), March 1995.

[65] Jim Turley. ARM Grabs Embedded Speed Lead. Microprocessor Report, 10(2), February

1996.

[66] Jim Turley. ARM Tunes Piccolo for DSP Performance. Microprocessor Report, 10(15),

November 1996.

[67] Jim Turley. Hitachi SH-4 Gets Graphically Superscalar. Microprocessor Report, 10(14),

October 28, 1996.

[68] Jim Turley. LSI's TiniyRisc Core Shrinks Code Size. Microprocessor Report, 10(14),

October 28, 1996.

[69] Jim Turley. ARM9 Doubles ARM Performance in 98. Microprocessor Report, December

8, 1997.

[70] Jim Turley. M-Core Shrink Code, Power Budgets. Microprocessor Report, October 27,

1997.

[71] Jim Turley. Selecting a High-Performance Embedded Microprocessor. MicroDesign

Ressources, second edition, 1997. ISBN 1-885330.

[72] Jim Turley. M-Core M300 Gains Poise, Performance. Microprocessor Report, December 7,

1998.

[73] Jim Turley. MMC2001 launches M-Core odyssey. Microprocessor Report, March 30, 1998.

[74] Jim Turley and Harri Hakkarainen. TI's new 'C6x DSP screams at 1600 MIPS. Micropro-

cessor Report, 11(2), February 1997.

[75] J. F. Wakerly. Digital Design: Principles and Practices. Prentice Hall, Englewood Cli�s,

NJ, 1994.

[76] N. J. Warter, S. A. Mahlke, W. W. Hwu, and B. R. Rau. Reverse if-conversion. In

Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design

and Implementation, pages 290�299, June 1993.

[77] W. Wolf. Modern VLSI Design: Systems on Silicon. Prentice Hall, New Jersey, 2nd edition,

1998.

[78] Ole Wolfe and Je� Bier. StarCore Launches First Architecture. Microprocessor Report,

October 26, 1998.

[79] M. Yang, G.-R. Uh, and D. B. Whalley. Improving performance by branch reordering. In

Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and

Implementation, June 1998.

[80] Gary K. Yeap. Practical Low Power VLSI Design. Kluwer Academic Publishers, 1998.

136 Bibliography

[81] T. Y. Yeh and Y. N. Patt. Two-level adaptive training branch prediction. In Proceedings

of the 24th Annual International Symposium on Microarchitecture, pages 51�61, November

1991.

Jean-Michel Puiatti

Personal Data

Born July 29, 1971 in Geneva, Switzerland. Single.

Citizenships: Swiss, Spanish, Italian.

Work Logic Systems Laboratory, Swiss Federal Institute of Technology,

IN-Ecublens, CH-1015 Lausanne, Switzerland.

Phone: +41�21�693�6630, Fax: +41�21�693�3705,

E-mail: Jean-Michel.Puiatti@ep�.ch, http://lslwww.ep�.ch/�puiatti

Home Maisonneuve 12D, CH�1219 Châtelaine, Switzerland.

Phone: +41�22�796-2908

Education

1995�1999 Swiss Federal Institute of Technology, Lausanne, Switzerland.Ph.D. Candidate, Computer Science.

Thesis title: Instruction-level parallelism for low-power processors.

In collaboration with the Centre Suisse d'Electronique et de

Microtechnique (CSEM SA).

1991�1995 Swiss Federal Institute of Technology, Lausanne, Switzerland.Diploma in Computer Engineering.

1986�1991 Engineering High School, Geneva, Switzerland.Graduated with honors in June 1991 in Electrical Engineering.

Work Experience

1995�present Swiss Federal Institute of Technology, Lausanne, Switzerland.Logic Systems Laboratory, Computer Science Department.

Research and Teaching Assistant in digital system design and computer

architecture.

Apr.98�Sep.98 University of Illinois at Urbana-Champaign, USA.IMPACT group, Center for Reliable and High-Performance Computing,

Implementation of compiler optimization techniques for parallel predicated

code.

Aug.97�Oct.97 Universitat Politècnica de Catalunya, Barcelona, Spain.Computer Architecture Department (DAC).

Implementation of performance and energy consumption estimators for the

parallel execution of loops in a VLIW CoolRISC architecture.

Mar.96�Jul.96 Centro Nacional de Microelectrónica, Barcelona, Spain.Study of a distributed autonomous sensor system.

Grants and proposals

1999 Bene�ts of EPIC architecture for multimedia applications.Contributed in securing a three-year project funded by Hewlett Packard.

1996�1999 Instruction-level parallelism for low-power processors.Obtained a three-year grant from the Centre Suisse d'Electronique et de

Microtechnique for research on high-performance low-power processors.

Languages

French � native

Spanish � excellent

English � good

Italian � good

German � basic

Hobbies

Soccer, squash, rock climbing, photography, classical and electric guitar.

Publications

D. A. Connors, J.-M. Puiatti, D. I. August, K. M. Crozier, and W. W.

Hwu. An Architecture Framework for Introducing Predicated Execution

into Embedded Processors, to appear in Proceedings of Euro-Par, Septem-

ber 1999.

D. I. August, J. W. Sias, J.-M. Puiatti, S. A. Mahlke, D. A. Connors, K.

M. Crozier, and W. W. Hwu. The Program Decision Logic Approach to

Predicated Execution, 26th International Symposium on Computer Archi-

tecture, May 1999.

G. Ritter, J.-M. Puiatti, E. Sanchez. Leonardo and discipulus simplex: An

Autonomous, Evolvable Six-Legged Walking Robot, Recon�gurable Archi-

tectures Workshop (RAW'99), 13th International Parallel Processing Sym-

posium & 10th Symposium on Parallel and Distributed Processing, San-

Juan (Puerto Rico), April 1999.

J.-M. Puiatti, C. Piguet, E. Sanchez, J. Llosa. Low-Power VLIW Proces-

sors: A High-Level Evaluation, 8th International Workshop on Power and

Timing Modeling, Optimization and Simulation (PATMOS'98), Lyngby

(Copenhagen-Denmark), October 1998.

J.-M. Puiatti, E. Sanchez, C. Piguet, J. Llosa, VLIW Architectures for Low-

Power Processors: A First Evaluation, 24th European Solid-State Circuits

Conference (ESSCIRC'98), The Hague (Netherlands), September 1998.

E. Mosanya, J.-M. Puiatti, E. Sanchez. Hardware Implementation of Gen-

eralized Pro�le Search on the GENSTORM Machine, IEEE Symposium on

FPGAs for Custom Computing Machines (FCCM'98), Napa Valley (CA-

USA), April 1998.

Download - Abstract - École Polytechnique Fédérale de Lausannelsl...le même niv eau de p erformance. DEVIL p eut exécuter jusqu'à deux instructions en parallèle à c haque coup d'horloge

Top Related