Instruction-Level Parallelism for Low-Power
Embedded Processors
THÈSE No 2110
Présentée au Département d'informatique
ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNEpour l'obtention du grade de Docteur ès sciences techniques
par
Jean-Michel PuiattiIngénieur Informaticien
de l'Ecole Politechnique Fédérale de Lausanne, Suisse
présentée au jury:
Prof. Eduardo Sánchez directeur de thèseProf. Christian Piguet corapporteurProf. Wen-mei Hwu corapporteurProf. Alain Wegmann corapporteur
Lausanne, EPFL
1999
ii
Abstract
In recent years, the market for special-purpose devices designed for advanced applications
has grown at a tremendous rate. As a result, the demand for embedded microprocessors,
a necessary component of these devices, is stronger than ever. The nature of devices such as
Personal Digital Assistants (PDAs), mobile phones, printers, and networking equipment requires
that these embedded processors meet high performance levels while simultaneously satisfying
strong constraints on power consumption and cost.
Instruction-Level Parallelism (ILP) is one of the major forces increasing the performance
of high-end workstation processors. Such ILP architectures are highly complex and exhibit
a large amount of power dissipation. However, parallelism is also a well-known power-saving
technique that can be used to improve the energy e�ciency of a system. ILP can thus be a
very attractive technique for embedded processors that require increased performance at a low
energy consumption.
This work focuses on the design of synergistic hardware-compiler ILP architectures, such
as EPIC or VLIW machines, for low-power embedded processors. Such synergism minimizes the
hardware overhead of multiple-issue pipelines, while maintaining the performance bene�ts of
ILP. Introducing parallelism into a processor drastically alters its architecture. To understand
and quantify how such modi�cations can reduce or nullify the expected bene�ts, and also to
assess where the tradeo�s should be made, a new EPIC-like low-power processor, DEVIL, is
proposed. Its implementation is the subject of a detailed experimental evaluation.
DEVIL includes a fetch mechanism that supports variable instruction lengths and allows
the compiler to explicitly encode parallelism within an instruction bundle. It will be shown that
this mechanism allows savings of 50% on average in the code size with respect to a standard
VLIW fetch mechanism while keeping performance unchanged.
DEVIL, with its 2-issue pipeline, achieves a speed-up of 1.5 on average compared to a
1-issue processor. This performance enhancement allows DEVIL to work at a lower voltage and
a lower clock frequency while keeping the same level of performance of a scalar processor. It
will be demonstrated that DEVIL can execute a task a the same speed than a scalar processor
while requiring an energy consumption approximatively 38% smaller.
ILP architectures generally su�er from a large amount of code expansion. This negative
e�ect is reduced thanks to DEVIL's instruction fetch mechanism. However, DEVIL still su�ers
from a code size penalty, compared to a scalar processor. To counter this unfortunate fact, a
step is made towards predication techniques. It will be shown that a full-predication support
with an adequate instruction fetch mechanism allows to generate parallel code that is 12% faster
and 25% smaller.
iii
iv
Version abrégée
La forte croissance du marché des systèmes embarqués durant ces dernières années engendre
un important besoin de processeurs performants, sujets à de sévères contraintes de puissance
consommée et de coût. Ces processeurs embarqués équipent en e�et des téléphones mobiles,
des agendas électroniques (PDAs), des imprimantes ou encore des équipements de réseaux
informatiques.
Le parallélisme au niveau d'instruction est l'une des principales techniques permettant
l'augmentation des performances des processeurs équipant les stations de travail. Ces circuits
sont toutefois de haute complexité et leur puissance dissipée s'avère très importante. Le paral-
lélisme constitue également une technique permettant la réduction de la consommation d'énergie
d'un circuit. Cette double caractéristique rend le parallélisme au niveau d'instruction très at-
trayant pour ces processeurs embarqués requérant un haut niveau de performance et une faible
consommation de puissance.
Cette thèse se concentre sur la conception d'architectures parallèles de basse consomma-
tion d'énergie o�rant une synergie entre le compilateur et le matériel, comme par exemple les
processeurs EPIC et VLIW. Cette interaction compilateur-processeur permet le déplacement
d'une grande partie de la complexité des architectures superscalaires vers le compilateur, tout
en conservant le même niveau de performance. Cependant, l'introduction du parallélisme dans
le datapath d'un processeur modi�e fortement son architecture. A�n de comprendre et de
quanti�er les répercussions de ces modi�cations, nous avons développé un nouveau processeur
de type EPIC, appelé DEVIL, dont l'implémentation a fait l'objet d'une analyse détaillée.
DEVIL intègre une unité de fetch particulière permettant d'encoder le parallélisme d'un
paquet d'instructions et supportant des instructions à taille variable. Ce mécanisme permet
l'obtention d'un code 50% plus compact que celui d'un processeur VLIW standard, tout en
maintenant le même niveau de performance.
DEVIL peut exécuter jusqu'à deux instructions en parallèle à chaque coup d'horloge. Cette
caractéristique permet d'augmenter d'un facteur moyen de 1.5 les performances par rapport à
un processeur scalaire. Ce gain compense les pertes de vitesse d'exécution induites par les modes
de fonctionnement à basse fréquence et basse tension requis pour la faible consommation. Nous
montrons que DEVIL exécute des tâches à la même vitesse qu'un processeur scalaire, tout en
consommant 38% d'énergie en moins.
Les processeurs exploitant le parallélisme au niveau d'instruction sou�rent généralement
d'une augmentation de la taille du code. Cet e�et indésirable est restreint par le mécanisme
de fetch inclus dans DEVIL. La taille de code de DEVIL demeure cependant supérieure à celle
d'un processeur scalaire. L'exécution à prédicats (ou exécution conditionnelle) constitue une
solution à ce problème. Les résultats de nos travaux établissent que, moyennant l'exécution à
prédicats et une unité de fetch adéquate, il est possible de générer du code 12% plus rapide et
25% plus petit.
v
vi
Acknowledgments/Remerciements
Pendant toute la durée de ce travail j'ai eu l'occasion de pouvoir compter sur l'appui de nom-
breuses personnes. J'aimerais de tout mon coeur leur dire MERCI.
En premier lieu j'aimerais exprimer toute ma gratitude à ma famille qui a toujours été à
mes côtés et qui m'a transmis son amour, sa folie et sa joie de vivre.
Charo, merci de m'avoir donné tout ton amour pendant ces quatre dernières années et de
n'avoir jamais hésité à me soutenir et être à mes côtés dans les moments les plus di�ciles.
Cette thèse n'aurait pas existé sans la contribution de diverses personnes qui m'ont permis
de travailler sur ce sujet passionnant. Eduardo, mon directeur de thèse, merci de m'avoir
guidé, d'avoir toujours été à l'écoute de mes problèmes, et de m'avoir fait partager ta culture
et ta passion, la musique. Daniel, merci de m'avoir acueilli dans ton laboratoire et de m'avoir
donné autant de liberté. Je remercie tout particulièrement le Centre Suisse d'Electronique et
de Microtechnique (CSEM) qui a �nancé ce projet de thèse. Merci à Christian et à toute son
équipe, toujours disponibles quand j'en avais besoin. Un merci particulièrement chaleureux à
mon vieil ami Flavio qui m'a beaucoup aidé. Je remercie également Michel Benard pour ses
conseils et sa générosité. Merci au jury de thèse, composé des professeurs J.-D. Nicoud, C.
Piguet, W.-M. Hwu et A. Wegmann, pour ses suggestions.
J'aimerais remercier mes collègues du LSL qui ont montré une grande disponibilité et une
humeur à toute épreuve. Ils ont su créer une ambiance de travail qu'il sera di�cile de retrouver.
Marlyse, le rayon de soleil du LSL, merci pour toute l'aide que tu m'as apportée. André, alias
�Chico�, mon joyeux complice (une sacrée équipe !), merci pour tous les services que tu m'as
rendus, tout particulièrement pour toutes les petites attentions que tu as eues pour moi durant
les derniers mois, elles m'ont vraiment remonté le moral dans des moments plutôt pénibles.
Fabio dit �El �aco� merci pour ton aide et ton amitié. Jean-Luc, ô grand Dieu de LaTeX, merci
pour tous tes tuyaux et merci pour toutes les corrections que tu as apportées à mes documents.
Jacques-Olivier, �Jacô� pour les intimes, ta gentillesse, ta disponibilité m'ont été d'une grand
aide. Un merci tout particulier pour m'avoir aidé à mener à bien ce travail. Un grand merci à
Gianluca qui a consacré une grande partie de son temps à corriger mon anglais et à organiser
des soirées bien arrosées. Merci aussi de m'avoir transmis les recettes culinaires de ta famille.
Dom, dit �Dominonski�, ça fait déjà un bout de temps qu'on se connaît, merci pour ton amité
et tes conseils. Eméka, le joyeux luron, merci pour ta bonne humeur et pour ton enthousiasme,
que les PàF soient avec toi. Andrés, merci de m'avoir donné un coup de main à chaque fois que
j'en avais besoin, et en plus, toujours avec le sourire. Carlos-Andrés, �vos sos un fenomeno�, je
te remercie de m'avoir fait partager un peu de ta culture latino-américaine. Moshe dit �lucky
luke�, celui qui écrit plus vite que son ombre, merci pour ton aide et pour avoir corrigé mes
documents. Mathieu le philosophe, Enrico �le bô�, Jacques dit �J.K.�, André �le hardeux�, merci
pour avoir partagé toutes vos connaissances.
J'aimerais aussi remercier tous les gens que j'ai connus au LSL et qui ont contribué à rendre
cette période inoubliable. Marco, Christian, Serge, Maxime, ça a été un plaisir de travailler
vii
viii
avec vous, merci pour votre aide. Merci aussi à Georges et Peter de l'ACORT. Christof, merci
pour ta contribution.
Durant ces quatre années j'ai eu l'opportunité de travailler avec di�érents groupes. Ces
collaborations m'ont enrichi à tous les points de vue: culturel, scienti�que et émotionnel.
Mil gracias a toda la gente que conocí en el Centro Nacional de Microelectrónica de
Barcelona. Esta estancia fue una experiencia humana inolvidable. Jordi, gracias por haberme
permitido pasar tres meses en tu grupo. Lluis, gracias por todo lo que has hecho por mi, nos
vemos en la proxima �esta! Rosa, eres un encanto, gracias por haberme cuidado tanto. Elena,
por haberme llevado todo los dias al trabajo, aunque fuera tan temprano. Y gracias a todos los
miembros del grupo con quien he compartí momentos de locura! Inolvidable!
Mateo, quien desde el primer contacto me abrió las puertas de su grupo. La estancia en
el DAC fue un momento clave en mi tesis. Gracias por tu ayuda y tu apoyo cada vez que
te necesitaba. Josep y Eduard, gracias a vosotros por vuestra ayuda y a todo el grupo del
Departament de Arquitectura de Computadors por todos los momentos de alegria y de buena
vida compartidos.
Dear Sabrina and Wen-mei, a big THANKS for your friendship, for allowing me to work
with your group and for all your help. You taught me a lot about life and did a lot for me.
Probably, without your help this work had not been acheived. I would like to thank all the
IMPACT members for their help. A particular thank you to my friends David and Dan who
allowed me to work with them. Dan, thank you also for your help in the writing of this thesis.
John and Liesle, thank you for your friendship and for allowing me to share your toys and
candies.
J'aimerais aussi remercier tous mes amis qui m'ont permis d'avoir une vie des plus agréables.
Merci aux nombreux membres Clubmax pour avoir proposé toutes ces activités. Catherine et
Arnaud, merci pour votre amitié et pour avoir corrigé une partie de ma thèse. Merci à mes
colocataires: Mari Carmen et Joaquin pour m'avoir initié à la rodaeta et pour m'avoir présenté
a la femme de ma vie; Carmen et Dimitri pour m'avoir hébergé et savoir que je pouvais toujours
compter sur eux; Dani, Elvira et Gonçalo pour les soirées passées ensemble; Eduardo �Maestro�,
gracias por abrirme tu casa, y por enseñarme la vida nocturna de tu ciudad. Merci, à toute
l'équipe des GIGI (championne de la ligue EPFL 1999) pour les bons matchs joués.
Contents
Abstract iii
Version abrégée v
Acknowledgments/Remerciements vii
1 Introduction 1
2 Instruction-Level Parallelism 3
2.1 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Instruction-Level Parallelism: Concepts and Limitations . . . . . . . . . . . . . 4
2.2.1 Data Dependences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Control Dependences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Resource Con�icts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 E�ect of Control Dependences in Pipelined Execution . . . . . . . . . . 8
2.3.2 E�ect of Data Dependences in Pipelined Execution . . . . . . . . . . . . 9
2.3.3 Resource Con�ict in Pipelined Execution . . . . . . . . . . . . . . . . . 10
2.4 Superscalar Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 In-order Issue with In-order Completion . . . . . . . . . . . . . . . . . . 12
2.4.2 In-order Issue with Out-of-order Completion . . . . . . . . . . . . . . . . 13
2.4.3 Out-of-order Issue with Out-of-order Completion . . . . . . . . . . . . . 14
2.4.4 Exception Recovery and Register Data�ow in Superscalar Processors . . 15
2.5 Very Long Instruction Word Architectures . . . . . . . . . . . . . . . . . . . . . 16
2.6 Compiler Techniques to Extract ILP . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 Basic Block Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.2 Superblock Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.3 Predicated Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Power Consumption in CMOS Circuits 25
3.1 Sources of Power Dissipation in CMOS Circuit . . . . . . . . . . . . . . . . . . 25
3.1.1 Static Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Dynamic Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Metrics for Energy E�ciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Parallelism for Energy E�ciency . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
ix
x Contents
4 Mobile and VLIW Processors:a State of the Art 334.1 The Advanced RISC Machine (ARM) Family . . . . . . . . . . . . . . . . . . . 33
4.1.1 The ARM7 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 The StrongARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.3 The ARM Thumb Option . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.4 The ARM Piccolo Option . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.5 The ARM9 and the ARM10 . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 The Motorola M�Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 The LSI TinyRisc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 The Hitachi SuperH Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 VLIW Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5.1 The Texas Instrument TMS320C6201 . . . . . . . . . . . . . . . . . . . 37
4.5.2 The Motorola-Lucent Star*Core . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 The Philips Trimedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.7 The HP/Intel IA-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.8 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Low-Power VLIW Processors:A High-Level Evaluation 415.1 Description of the Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 CoolRISC 816: A Low-power 8-bit Processor . . . . . . . . . . . . . . . . . . . 43
5.2.1 The CoolRISC 816 Architectural Characteristics . . . . . . . . . . . . . 43
5.2.2 The Performance of CoolRISC 816 . . . . . . . . . . . . . . . . . . . . . 44
5.2.3 The Energy Consumption of the CoolRISC 816 . . . . . . . . . . . . . . 45
5.3 Compared Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.1 Scalar Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.2 VLIW Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Consumption Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4.1 Estimate of Eoper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4.2 Estimate of Ecode and Edata . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.3 Estimate of Econn and ERF . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 The DEVIL Low-power Processor 536.1 Where Is The Complexity in VLIW Architectures? . . . . . . . . . . . . . . . . 53
6.1.1 Hardware Duplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1.2 Code Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 De�nition of the DEVIL Processor . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 DEVIL's Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4 DEVIL's Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4.1 Arithmetical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4.2 Logical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4.3 Compare Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4.4 Move Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Contents xi
6.4.5 Branch Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4.6 Data Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.5 The DEVIL Instruction Fetch Mechanism . . . . . . . . . . . . . . . . . . . . . 59
6.6 DEVIL's Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.6.1 Pipelined Execution for ALU Operations . . . . . . . . . . . . . . . . . . 61
6.6.2 Pipelined Execution for Memory Operations . . . . . . . . . . . . . . . . 62
6.6.3 Pipelined Execution for Branch Operations . . . . . . . . . . . . . . . . 63
6.6.4 DEVIL's Branch Prediction Mechanism . . . . . . . . . . . . . . . . . . 63
6.7 Evaluation of the DEVIL Architecture . . . . . . . . . . . . . . . . . . . . . . . 64
6.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.7.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.7.3 DEVIL's Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.7.4 DEVIL's memory utilization . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.8 Comparison With Existing Mobile Processors . . . . . . . . . . . . . . . . . . . 73
6.8.1 Instruction Set Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.8.2 Code Size Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7 Implementation ofthe DEVIL Processor 777.1 Technology and Synthesis Methodology . . . . . . . . . . . . . . . . . . . . . . 77
7.2 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.3 The DEVIL Latch-Based Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.4 DEVIL Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.4.1 DEVIL's Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.4.2 Fetch and Dispatch Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.4.3 Program Counter Datapath . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.4.4 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.4.5 Arithmetic and Logic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.4.6 Load/Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5 DEVIL Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.5.1 DEVIL's Circuit Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.5.2 DEVIL's Circuit Complexity . . . . . . . . . . . . . . . . . . . . . . . . 85
7.5.3 DEVIL's Circuit Power Consumption . . . . . . . . . . . . . . . . . . . . 85
7.6 Comparison With Existing Processors . . . . . . . . . . . . . . . . . . . . . . . 86
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8 A Step Towards Predicated Execution 898.1 Architecture Support for Full Predicated Execution . . . . . . . . . . . . . . . . 90
8.2 Compiler Techniques for Reducing Predicated Code Size . . . . . . . . . . . . . 91
8.2.1 Reduction of Number of Control Instructions . . . . . . . . . . . . . . . 91
8.2.2 Predicate Promotion and Instruction Merging . . . . . . . . . . . . . . . 91
8.2.3 Instruction Reduction for Advanced Code Transformation . . . . . . . . 93
8.3 Introducing Predication Support into Embedded Processors . . . . . . . . . . . 94
8.3.1 E�ect on Code Size of Full Predication Support . . . . . . . . . . . . . . 94
8.3.2 Predication Code Size and Execution Characteristics . . . . . . . . . . . 96
8.3.3 Pre�x-Based Predication . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.3.3.1 Architecture Model . . . . . . . . . . . . . . . . . . . . . . . . 98
xii Contents
8.3.3.2 Microarchitecture support . . . . . . . . . . . . . . . . . . . . . 98
8.3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.3.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.3.4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 99
8.4 Control �ow optimization using predication . . . . . . . . . . . . . . . . . . . . 100
8.4.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.4.2 Limitations of PlayDoh . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.4.3 Overview of Compiler Techniques . . . . . . . . . . . . . . . . . . . . . . 103
8.4.4 Minimization of Program Decision Logic . . . . . . . . . . . . . . . . . . 107
8.4.5 Architecture Support for Synthesis . . . . . . . . . . . . . . . . . . . . . 113
8.4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9 Conclusion 119
A The DEVIL's Instruction Set Summary 123A.1 Functions De�nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.2 Arithmetical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.3 Logical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.4 Compare Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.5 Move Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.6 Branch Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.7 Data Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Bibliography 136
List of Figures
2.1 Di�erent type of data dependencies: (a) Flow dependence, (b) Anti-dependence,
(c) Output dependence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Register renaming suppresses anti and output dependences. (a) code before
register renaming, (b) dependence graph before register renaming, (c) code after
register renaming, (d) dependence graph after register renaming. . . . . . . . . 5
2.3 Arithmetic transformation for critical path reduction. . . . . . . . . . . . . . . . 6
2.4 Control dependences: (a) C code, (b) corresponding control �ow graph. . . . . 7
2.5 Instruction timing for a non-pipelined processor. . . . . . . . . . . . . . . . . . 7
2.6 Instruction timing for a four-stage pipelined processor . . . . . . . . . . . . . . 8
2.7 (a) Instructions executed in a 2-stage pipeline, (b) instructions executed in a
4-stage pipeline, (c) instructions executed in a 8-stage pipeline. . . . . . . . . . 9
2.8 Illustration of the control dependencies in a three-stage pipeline. . . . . . . . . 10
2.9 Pipeline stall due to a RAW data dependence. . . . . . . . . . . . . . . . . . . . 10
2.10 Result bypassing avoids the pipeline stall due to one-cycle RAW data dependences. 11
2.11 Delay slot due to a one-cycle load latency. . . . . . . . . . . . . . . . . . . . . . 11
2.12 Execution timing of a generic two-issue superscalar processor. . . . . . . . . . . 12
2.13 Block diagram of a generic four-unit superscalar processor. . . . . . . . . . . . . 13
2.14 Superscalar pipeline with in-order issue and in-order completion. . . . . . . . . 14
2.15 Superscalar pipeline with in-order issue and out-of-order completion. . . . . . . 14
2.16 Superscalar pipeline with out-of-order issue and out-of-order completion. . . . . 15
2.17 In-order, lookahead, and architectural state for an out-of-order issue superscalar
processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.18 Execution timing of a generic two-issue VLIW processor. . . . . . . . . . . . . . 17
2.19 Block diagram of a generic four-unit VLIW processor. . . . . . . . . . . . . . . 18
2.20 Example of formation of VLIW instructions: (a) sequential code, (b) the corre-
sponding dependence graph, (c) the corresponding VLIW code. . . . . . . . . . 18
2.21 Control �ow graph with basic blocks: (a) original C code, (b) corresponding
assembly code, (c) corresponding control �ow graph. . . . . . . . . . . . . . . . 19
2.22 Superblock formation: (a) weighted �ow graph, (b) trace formation, (c) tail
duplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.23 Loop enlarging optimizations: (a) original loop, (b) loop peeling, (c) loop unrolling. 20
2.24 Dependence removing: (a!b) accumulator variable expansion, (b!c) induction
variable expansion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.25 (a) A simple if-then-else C code construct, (b) unpredicated code, (c) predicated
code, and (d) optimized predicated code. . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Static CMOS inverter: (a) gate, (b) transistors, and (c) switches representation. 25
3.2 Short-circuit current in a static CMOS inverter. . . . . . . . . . . . . . . . . . . 26
xiii
xiv List of Figures
3.3 Charge and discharge of the load capacitance in a static CMOS inverter. . . . . 27
3.4 Relative circuit delay (left) and relative energy consumption (right) as function
of Vdd. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Energy distribution of several processor con�gurations executing the same task. 30
4.1 Comparison: (a) MIPS vs. Power; (b) MIPS vs. mw/MIPS. . . . . . . . . . . . 39
5.1 Block diagram of the experimental framework. . . . . . . . . . . . . . . . . . . . 42
5.2 Parallel execution of a loop using software pipelining. . . . . . . . . . . . . . . . 43
5.3 Energy consumption distribution in the CoolRISC 816. . . . . . . . . . . . . . . 45
5.4 VLIW architecture: NOP elimination. . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Speed-up comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.6 Energy comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Energy-Delay Product comparison. . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1 ROM code memory die area as a function of code size. . . . . . . . . . . . . . . 55
6.2 Power consumption of the ROM code memory as a function of code size. . . . . 56
6.3 Block diagram of the DEVIL architecture. . . . . . . . . . . . . . . . . . . . . . 57
6.4 Instruction bundle formation in the DEVIL processor. . . . . . . . . . . . . . . 61
6.5 DEVIL's pipeline: ALU operations. . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.6 DEVIL's pipeline: memory operations. . . . . . . . . . . . . . . . . . . . . . . . 62
6.7 DEVIL's pipeline: conditional branch operations. . . . . . . . . . . . . . . . . . 63
6.8 DEVIL's branch prediction mechanism. . . . . . . . . . . . . . . . . . . . . . . 64
6.9 DEVIL's compile-time branch prediction bene�ts. . . . . . . . . . . . . . . . . . 64
6.10 The IMPACT compiler framework. . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.11 DEVIL performance with and without superscalar optimizations compared to
1-issue and 4-issue architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.12 E�ect of superscalar optimizations on code size. . . . . . . . . . . . . . . . . . . 68
6.13 E�ect of superscalar optimizations on the number of accesses to the code memory. 68
6.14 E�ect of NOP elimination on code size. . . . . . . . . . . . . . . . . . . . . . . 69
6.15 E�ect of NOP elimination on the number of accesses to the code memory. . . . 70
6.16 E�ect of the variable instruction length mechanism on code size. . . . . . . . . 71
6.17 E�ect of the variable instruction length mechanism on number of accesses to the
code memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.18 E�ect of the DEVIL instruction fetch mechanism on the code size. . . . . . . . 72
6.19 E�ect of the DEVIL instruction fetch mechanism on the number of accesses to
the code memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.20 Code size comparison between DEVIL and some other mobile processors. . . . 74
7.1 A two-phase non overlapping pipeline using latches. . . . . . . . . . . . . . . . . 79
7.2 DEVIL's pipeline implementation with non-overlapping clocks. . . . . . . . . . 80
7.3 DEVIL datapath block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.4 Fetch and dispatch datapath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.5 Program counter datapath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.6 Register �le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.7 ALU datapath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.8 Data Memory Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.1 Predication example: (a) original, (b) optimized, and (c) predicated. . . . . . . 92
List of Figures xv
8.2 Merging example: (a) source code, (b) original, and (c) predicated. . . . . . . . 93
8.3 Loop optimization example: (a) original, (b) unrolled superblock, and (c) un-
rolled predicated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.4 Relative number of predicated instructions. . . . . . . . . . . . . . . . . . . . . 95
8.5 Code expansion considering predication source operand. . . . . . . . . . . . . . 96
8.6 Code reductions due to predicated execution. . . . . . . . . . . . . . . . . . . . 97
8.7 Pre�x-based predication decoding of normal and predicated instructions. . . . . 98
8.8 Performance of varying instruction cache size for pre�x-based predicated archi-
tecture relative to non-predicated architecture. . . . . . . . . . . . . . . . . . . 99
8.9 Code expansion of superscalar relative to traditional optimization. . . . . . . . 100
8.10 A portion of the inner loop of the UNIX utility wc. The control �ow graph (a),
and the corresponding hyperblock formed after complete if-conversion (b). . . 104
8.11 The wc hyperblock after speculation but before logic minimization (a) and its
corresponding logic diagram (b). The hyperblock after logic minimization (c)
and its corresponding logic diagram (d). . . . . . . . . . . . . . . . . . . . . . . 105
8.12 Comparison of the static schedules for the wc hyperblock before and after logic
minimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.13 Example: optimization of wc predicate network. . . . . . . . . . . . . . . . . . . 108
8.14 Pseudo-code for performing optimization of predicate expressions . . . . . . . . 109
8.15 Factorized predicate de�ne optimization. . . . . . . . . . . . . . . . . . . . . . . 112
8.16 Various methods of predicate expresssion regeneration. . . . . . . . . . . . . . . 113
8.17 Speedup from minimization of program decision logic. . . . . . . . . . . . . . . 115
List of Tables
3.1 Summary of the bene�ts of parallelization and voltage down-scaling. . . . . . . 31
4.1 Mobile, Embedded, and ILP processor comparision. . . . . . . . . . . . . . . . . 39
5.1 Characteristics of CoolRISC's low-power ROM (Vdd=3V). . . . . . . . . . . . . 44
5.2 Relative utilization of the core, the code memory, and the data memory . . . . 45
6.1 Execution modes of DEVIL's instruction bundles. . . . . . . . . . . . . . . . . . 60
6.2 Benchmark list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.1 Transistor count breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Power consumption breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3 Summary of the bene�ts of ILP for low-power. . . . . . . . . . . . . . . . . . . 86
7.4 Mobile, Embedded, and ILP processor comparison. . . . . . . . . . . . . . . . . 86
8.1 Predicate de�nition truth table. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.2 Instruction merging and predicate promotion characteristics. . . . . . . . . . . . 97
8.3 Extented predicate de�nition truth table. . . . . . . . . . . . . . . . . . . . . . 102
8.4 Speedup and predicate de�ne count for selected functions. . . . . . . . . . . . . 116
8.5 E�ects of conjunctive-type predicate de�nes on speedup and instruction count. 118
A.1 DEVIL's arithmetical instructions . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.2 DEVIL's logical instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.3 DEVIL's compare instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.4 DEVIL's move instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.5 DEVIL's branch instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.6 DEVIL's load/store instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.7 DEVIL's load/store instructions (second part) . . . . . . . . . . . . . . . . . . . 130
xvii
Chapter 1
Introduction
In recent years, the market for special-purpose devices designed to perform advanced applica-
tions has grown at a tremendous rate. As a result, the demand for embedded microprocessors,
a necessary component in these devices, is stronger than ever. The nature of devices such as Per-
sonal Digital Assistants (PDAs), mobile phones, printers, and networking equipment requires
that these embedded processors meet high performance levels while simultaneously satisfying
constraints on power consumption and cost.
To be competitive, manufacturers o�er a wide range of products that meet these strong
design constraints and place a high priority on the energy e�ciency, a crucial feature, of such
processors. Obviously, minimizing power consumption increases the autonomy of portable sys-
tems, such as mobile phones, and increases the product's worth to the consumer. In addition,
energy e�ciency has an e�ect on total system cost, which may be even more important in some
applications. Reducing the power dissipation in a integrated circuit reduces the price of the
packaging, of the power supply, of the heat dissipation mechanism, and also increases the chip's
reliability.
In the design of processors, a trade-o� is routinely made between performance, power con-
sumption, and cost. In fact, most techniques developed to enhance performance in high-end
systems increase the cost of the system and its power consumption. For example, instruction
caches, data caches, sophisticated branch predictors, hardware duplication, and dynamic in-
struction schedulers increase a circuit's complexity, which may imply a relatively large amount
of extra power dissipation.
In return there exist some performance-enhancing hardware features that can also re-
duce power consumption. Using such features in conjunction with clock frequency and voltage
down-scaling, may result in lowering of total energy required to complete the task; the overall
performance, although the clock frequency has been reduced, remains the same [19]. Such a
feature adds value to the product by increasing either performance or energy e�ciency, or both.
Clearly, designers should embrace these techniques whenever possible.
Parallelism is one such technique. Currently, Instruction-Level Parallelism (ILP) is one of
the major forces increasing the performance of high-end workstation processors. The resulting
architectures are highly complex and exhibit a large amount of power dissipation. A known
example is the DEC Alpha 21264 which, with more than 15.2 million transistors, dissipates as
much as 70W [2]. However, parallelism is also a well known low-power technique that can be
used to improve the energy e�ciency of a system [19].
Investigations into the overall energy e�ciency of pipelined and superscalar architectures
1
2 Introduction
in general purpose processors demonstrated that superscalar techniques does not signi�cantly
improve the energy e�ciency of a processor [24]. This is due mostly to the overhead introduced
by the superscalar approach. However, parallelism and pipelining can still be employed to tune
the power consumption versus performance trade-o� [21, 19]. The key is to employ parallelism
and pipelining while reducing the overhead found in superscalar architectures through the use
of advanced compiler techniques. With combined hardware and compiler techniques, much of
the work performed by standard superscalar processors can be moved from run time to compile
time. Speci�cally, explicitly encoding the parallelism found at compile-time in the instructions
signi�cantly reduces the overhead found in superscalar processors. Explicitly Parallel Instruc-
tion Computing (EPIC) is the term used to describe architectures using this approach [25].
Currently, ILP has only been introduced into embedded processors through pipelined ar-
chitectures and, generally, such mobile processors do not include any multiple-issue mechanism.
There are some few exceptions. The Hitachi SH7750 [30], for example, is based on a 2-issue
superscalar architecture, but there are limitations on the available machine parallelism and the
processor exhibits a rather heavy power consumption of 1.6 W at 200 MHz at 1.8 V. Also, the
new generation of DSP architectures is based on multiple-issue Very Long Instruction Word
(VLIW) architectures, to exploit the large amount of instruction parallelism that can be found
in digital signal processing [74] [4].
The present thesis aims at �lling the lack of energy-e�cient multiple-issue embedded archi-
tectures. The bene�ts of ILP in low-power mobile processors are investigated in order to know
whether it can improve the trade-o� between performance and energy consumption � or not.
More precisely, this work focuses on the design of synergistic hardware-compiler architectures
such as EPIC or VLIW machines. Such synergism allows to minimize the hardware overhead of
multiple-issue pipelines, while maintaining the performance bene�ts of ILP. However, introduc-
ing parallelism into a processor drastically alters its architecture. It is necessary to understand
and quantify how such modi�cations can reduce or nullify the expected bene�ts and also to
assess where the tradeo�s should be made. Accordingly, a new EPIC-like low-power processor,
DEVIL, is proposed. Its implementation is the subject of a detailed experimental evaluation.
The thesis is structured as follows. Chapter 2 provides an introduction to the ILP tech-
niques and describes the relevant concepts for this work. An introduction to power consumption
in CMOS circuits is given in Chapter 3, along with an explanation of how parallelism can be used
to improve the energy e�ciency of a digital circuit. In Chapter 4 the state of the art in mobile
and VLIW processors design is given, highlighting the lack of energy-e�cient multiple-issue ar-
chitecture. Chapter 5 provides a high-level evaluation of the bene�ts of VLIW architectures for
low-power processors. DEVIL, a new low-power VLIW architecture, is proposed in Chapter 6;
its detailed architectural evaluation follows. In Chapter 7 the DEVIL's VLSI implementation is
given; the analysis of DEVIL's features, in terms of speed, power consumption and complexity,
is also reported. The bene�ts of predicated execution support for embedded processors are
analyzed in Chapter 8. Finally, in Chapter 9, concluding remarks are drawn.
Chapter 2
Instruction-Level Parallelism
The performance of modern processors is becoming highly dependent on their ability
to execute multiple instructions per cycle. These processors extract performance from
programs by exploiting the characteristic of Instruction-Level Parallelism (ILP). ILP
is extracted either at compile time or at run time from a program composed of sequential in-
structions. Thus an important feature of ILP techniques is that like circuit speed improvement,
they are generally transparent to users. Pipelined, superscalar and Very Long Instruction Word
(VLIW) processors are examples of processor architectures that derive their bene�t from ILP.
Superblock and hyperblock formation are examples of compilation techniques that expose the
parallelism that these processors can use.
First, this chapter brie�y introduces the main performance metric in order to identify the
factors that contribute to get a high level of performance. Then, it describes the most signi�cant
concepts of ILP as weel as their limitations. Furthermore it gives an insight to both hardware
and software techniques that exploit ILP. Readers who are interested in ILP concepts can refer
to the extensive literature such as [29] [58] [33] [53].
2.1 Performance Metric
The most common and reliable metric used for performance comparison between processors
is execution time, Texec, needed to execute a given task. Texec depends on three di�erent
parameters: the number of executed instructions, N , needed to execute the given task; the
number of instructions executed per clock cycles, IPC; and the processor clock frequency, f :
Texec =N
f � IPC(2.1)
For a given task, the processor with the lowest execution time is the best processor in
terms of performance. Generally, the comparison is established by computing the speed-up, S,
achieved by an architecture A compared to an architecture B:
S =TB
exec
TAexec
(2.2)
The ILP techniques that are described below aim to improve one or more of the parameters
N , f , and IPC, to enhance the processor performance. For example, pipelining increases the
3
4 Instruction-Level Parallelism
processor throughput by boosting f and IPC; superscalar processors augment the number of
instructions executed per clock cycle; and VLIW machines reduce the number of instructions
required to execute a task. Each of these techniques has its advantages, disadvantages and
limitations, the following subsections describe these features.
2.2 Instruction-Level Parallelism: Concepts and Limitations
The traditional way to code a program is to express an algorithm in a sequential language such
as C. After compilation, this results in an ordering of assembly instructions that are executed
sequentially. Although the merit of this approach is its simplicity, such sequences result in a
relatively poor level of performance. To overcome this limitation, ILP techniques are used to
expose independent instructions in a sequential program. With adequate hardware support, the
execution of such independent instructions can be parallelized, reducing the program execution
time.
The performance improvement that is given by instruction-level parallelism strongly de-
pends on the ability to �nd independent instructions. Data dependences, control dependences,
and resource con�icts are the fundamental limitations that bound the amount of available par-
allelism, and therefore the potential increase in performance. The next subsections describe
these dependences and the way to reduce their impact on performance.
2.2.1 Data Dependences
Data dependences occur between instructions that use the same operands, either registers or
memory. Data dependences are classi�ed in three categories:
� Flow dependence, or Read After Write (RAW) dependence, happens when an instruction
i2 has a source operand that is the result of a previous instruction i1, forcing i1 to be
executed before i2. This is the only true dependence (see Figure 2.1(a)).
� Anti dependence, or Write After Read (WAR) dependence, occurs in the opposite case
when an instruction i4 de�nes its result in an operand that is a source of a previous
instruction i3. Consequently, i3 must read its operands before i4 writes its results (see
Figure 2.1(b)).
� Output dependence, or Write After Write (WAW) dependence, happens when two instruc-
tions i5 and i6 write in the same destination operand. In this case i5 must be scheduled
in such a way that it writes its result before i6 (see Figure 2.1(c)).
add r2, r3, 3 add r3, r2, r1 add r4, r1, 5
mul r4, r2, r1 mov r2, 10 mov r4, 10
Flow dependence Anti-dependence Output dependence
(a) (b) (c)
Tim
e
i1:
i2:
i3:
i4:
i5:
i6:
Figure 2.1: Di�erent type of data dependencies: (a) Flow dependence, (b) Anti-dependence,
(c) Output dependence.
2.2 Instruction-Level Parallelism: Concepts and Limitations 5
These dependences limit code motion and optimizations at both compile time and run
time. This is why several techniques have been proposed to break such constraints. Register
renaming is the main technique that eliminates anti and output dependencies. Figure 2.2 shows
how register renaming works. Before register renaming (Figure 2.2(a)(b)), the code may contain
arti�cial dependences due to register allocation, thus limiting the parallelism. For example, i1
and i3 can not be executed in parallel because of the output dependence caused by the register
r3. When register renaming is applied (Figure 2.2(c)(d)), all the WAR and WAW dependences
are suppressed (Figure 2.2(d)) by renaming each register between each of its rede�nitions. For
example, instruction i1 rede�ne register r3, therefore r3 is renamed into r3a until the next
rede�nition of r3 (i3)
i1: sub r3, r3, r5i2: add r4, r3, 1i3: add r3, r5, 1i4: div r7, r3, r4
(a)
i1: sub
i2: add
i3: add
i4: div
i2: add
i3: add
i4: div
(b)
r3
r3
r4
r3
Output dependency
r3
(c)
i1: sub r3a, r3, r5i2: add r4, r3a, 1i3: add r3b, r5, 1i4: div r7, r3b, r4
(d)
r4
r3a
r3b
i1: sub
Anti-dependency
Flow dependency
Figure 2.2: Register renaming suppresses anti and output dependences. (a) code before register
renaming, (b) dependence graph before register renaming, (c) code after register renaming, (d)
dependence graph after register renaming.
Flow dependences can not be eliminated with register renaming; they inherently de�ne the
data �ow of the program, and are muchmore di�cult to eliminate. However, some optimizations
can modify the program data �ow. For example, arithmetic properties can be used to re-express
the data �ow of a sequence of instructions. Figure 2.3 illustrates how associativity can be used
to generate a more parallel code by better balancing the dependence graph. Other techniques
that eliminate �ow dependence are described in Subsection 2.6.2.
All the previous examples show data dependences between register operands, however there
can be also memory data dependences. These later occur between accesses to the same data
memory location, and introduces a new problem known as memory disambiguation. Although
6 Instruction-Level Parallelism
++
+
a
b cd
e +
+
+
a
b c d e
a = b + c + d + e
a = (b + c) + (d + e)a = ((b + c) + d) + e
3 cl
ock
cycl
es
2 cl
ock
cycl
es
Figure 2.3: Arithmetic transformation for critical path reduction.
it is easy to detect dependences between accesses to the same variable, it is much harder to
know if accesses made through pointers are independent. Indeed, at compile time, it is not
always possible to know the location of memory addressed by a pointer, and therefore some
memory dependences can not be resolved. In order to maintain the program correctness, in this
case, the scheduler should assume that there is a data dependence, which may conservatively
limit the parallelism of the program.
2.2.2 Control Dependences
Branch operations create another type of dependence, the control dependence that occurs be-
tween branch operations and instructions ordered after the branch. Figure 2.4 shows the control
�ow graph of a if-then-else structure, where the inc a, dec a, and jmp instructions are control
dependent of beq x, 0 because their execution depends on the outcome of the branch. The
main di�erence between control and data dependences is that the former are characterized by
run-time uncertainty since the target of the branch is not known until the end of the execution
of the conditional branch. This is why exploiting ILP in the presence of branches has been the
subject of much research. The two most commonly used techniques are control speculation and
predication. Control speculation is most commonly performed in superscalar processors using
a combination of branch prediction and dynamic scheduling [32][59][81]. Control speculation
increases ILP by guessing the outcome of a branch and executing instructions along the pre-
dicted path. In this manner, control dependences are broken to execute instructions before the
branch outcome is determined. Given an instruction set that supports speculative operations,
control speculation can also be performed statically by an aggressive compile-time scheduler
which moves instructions across branches [7][40].
Predication has become a popular instruction set architecture feature for expressing pro-
gram control by conditionally executing instructions [31][54]. A compiler can employ if-conversion
to convert a sequence of code containing branches into an equivalent branch-free sequence of
conditionally executed instructions [6]. Predicated execution increases ILP by allowing the
compiler to schedule operations from multiple paths of control for simultaneous execution.
2.2.3 Resource Con�icts
The number of available resources, such as functional units and register �le ports, also limit
the level of parallelism. Consequently, resource con�icts occur between two instructions that
require the same hardware resource at the same time. For example, if a processor has only
2.3 Pipelining 7
mov a, 0beq x, 0
inc ajmp
...
dec a
if ( x == 0) {a = 0;
a++;} else {
a--;}
(a)
(b)
taken not taken
Figure 2.4: Control dependences: (a) C code, (b) corresponding control �ow graph.
one memory port, two independent memory load operations must be executed sequentially.
Hardware duplication, i.e., to add an extra memory port in the above example, allows to
suppress such con�ict. Obviously, a trade-o� exists between circuit complexity and performance
enhancement.
2.3 Pipelining
The �rst generation of microprocessors generally executed and issued instructions in a purely
sequential way, and required several clock cycles to execute each instruction (see Figure 2.5).
The overall e�ect of the sequential execution was a very low instruction throughput.
CLK L�����������������������������H
instr. 1 .�VVVVVVVVVVVVVV�.........................................Fetch, Dec, Exec, W.B.
instr. 2 .................�VVVVVVVVVVVVVV�.........................Fetch, Dec, Exec, W.B.
instr. 3 .................................�VVVVVVVVVVVVVV�.........Fetch, Dec, Exec, W.B.
Figure 2.5: Instruction timing for a non-pipelined processor.
To reduce the performance penalty due to the sequential execution, pipelining has been
introduced in the processor architectures. Pipelining exploits instruction-level parallelism, by
dividing instruction execution in independent stages (i.e. pipeline stage), and overlapping
their execution. Therefore, several instructions are executed in parallel, but are still issued
sequentially. Figure 2.6 shows the execution timing of a four-stage pipeline (Fetch, Decode,
Execute, and Write back).
8 Instruction-Level Parallelism
CLK L��������������L
instr. 1 .�VV�VV�VV�VV�...........F D E WB
instr. 2 .....�VV�VV�VV�VV�.......F D E WB
instr. 3 .........�VV�VV�VV�VV�...F D E WB
F=Fetch
D=Decode
E=Execute
WB=Write Back
Figure 2.6: Instruction timing for a four-stage pipelined processor
Ideally, expending the number of pipeline stages increases the number of instructions that
are executed in parallel. This division reduces the processor's cycle time, and generally results
in a large performance improvement. Figure 2.7 shows how the throughput (i.e. performance)
increases with the number of pipeline stages, motivating the use of deep pipelines. However, the
bene�ts of pipelining are bounded by data and control dependences, and their e�ect becomes
stronger when the number of pipeline stage is increased. The next subsections describe how
dependences a�ect pipelined execution.
2.3.1 E�ect of Control Dependences in Pipelined Execution
Pipelined architectures generally need to fetch one instruction per clock cycle. When a condi-
tional branch is fetched, there is a delay before the direction of the branch (i.e. the address
of the next instruction) is resolved. Therefore, one or more instructions that are fetched after
the conditional branch instruction can come from the wrong path of execution. Figure 2.8
illustrates this phenomenon when there is no branch prediction, i.e, the branches are always
predicted as not taken. In cycle T2, when the conditional branch instruction is fetched, the pro-
cessor should compute the address of the next instruction. As the conditional branch, jt/jnt, is
still not decoded and the comparison, test.eq, is still not executed, the processor fetches the next
sequential instruction. During phase T3 the result of the comparison is computed and the new
program address is updated with a one-cycle delay. If the conditional branch is not taken, the
pipeline can continue the execution normally. However, if the branch is taken the instruction
fetched during cycle T3 comes from the wrong path, and therefore should be nulli�ed.
A simple way to address this problem is to always execute a �xed number of instructions,
referred to as branch delay slots, immediately following all control operations. For example,
in Figure 2.8 this correspond to always execute instruction Instr. 3. In this case the compiler
or programmer should put in the branch delay slot any instruction(s) which do not depend
on the branch outcome. This can result in the insertion of NOP operations, i.e., in a loss in
pipeline performance. This approach is simple and works for scalar processors where there is
only one or two instruction(s) in the delay slot. However, when this technique is applied to wide
issue processors, the number of instructions being in the branch delay slot becomes too high.
Consequently, it is impossible to �nd a su�cient number of instructions that can be moved in
the delay slot, resulting in a sever loss in performance and in an increase in code size due to
extra NOP insertion. In this case, another technique should be used to reduce such penalty.
2.3 Pipelining 9
Time
Up to 2 instructions executed in parallel
2-stage pipeline(a)
4-stage pipeline
(b)
8-stage pipeline(c)
Up to 4 instructions executed in parallel
Clock freq. = f
Clock freq. = 2f
Clock freq. = 4f
Up to 8 instructions executed in parallel
instr. 0
instr. 1
instr. 2
instr. 0
instr. 1
instr. 2
instr. 3
instr. 4
instr. 0
instr. 1
instr. 2
instr. 3
instr. 4
instr. 5
instr. 6
instr. 7
instr. 8
Figure 2.7: (a) Instructions executed in a 2-stage pipeline, (b) instructions executed in a 4-stage
pipeline, (c) instructions executed in a 8-stage pipeline.
Branch prediction enables the processor to speculate the target of branch instructions being
executed before the true branch target has been resolved. For correct predictions, the speculated
operations are useful, and the processor's pipeline is nor adversely a�ected. However, for
incorrect predictions, a costly performance penalty is incurred since the speculated instructions
must be removed from the pipeline and the processor pipeline restarted. There are several ways
to make a prediction: Fixed prediction makes always the same guess, either taken or not taken,
and is considered as a one-outcome guess; True prediction has two possible outcomes and can
be static if the prediction depends only on the code in question, or dynamic if the prediction
depends on the execution history.
2.3.2 E�ect of Data Dependences in Pipelined Execution
Each stage of execution receives data from its previous stage and furnishes data to its next
stage. Such data �ow introduces data dependences between instructions being executed in the
pipeline, limiting the processor's performance. For example, Figure 2.9 illustrates the case of
a RAW dependence in a four-stage pipeline. The ADD instruction adds 4 to R0 and store the
result in R1. The next instruction, MUL, reuses R1 as a source operand. At the end of cycle
T3, the MUL instruction has to read the source operands in the register �le; however, the result
of the addition is still not written, and then the hardware should stall the pipeline for one cycle.
Some pipeline stalls due to RAW dependences can be eliminated by a mechanism called
10 Instruction-Level Parallelism
T1 T2 T3 T4 T5
CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�
test.eq ...�VVVVVV�VVVVVV�VVVVVV�....................Fetch Decode Alu-WB
jt/jnt ...........�VVVVVV�VVVVVV�....................Fetch New PC
Instr.3 ...................�UUUUUU�VVVVVV�VVVVVV�....Fetch Decode Alu-WB
if the conditional branch is taken, squash Instr. 3
Figure 2.8: Illustration of the control dependencies in a three-stage pipeline.
T1 T2 T3 T4 T5 T6
CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HHH
ADD R1, R0, 4 ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�...................F D E WB
MUL R2, R1, 12 ...........�VVVVVV�VVVVVV�UUUUUU�VVVVVV�VVVVVV�...F D Stall E WB
Wait for R1
Figure 2.9: Pipeline stall due to a RAW data dependence.
bypassing. This latter is illustrated in Figure 2.10. When a RAW dependence is detected by
the hardware, instead of reading the source operand in the register �le, the result is directly
transmitted to the execution unit. In our example, the hardware detects the RAW dependence
during T3, and the result of the addition is bypassed to the multiplication, at the end of phase
T3.
Bypassing can eliminate all the RAW dependences having a one-cycle distance as in Fig-
ure 2.10. However, when an instruction has a latency of several clock cycles, which is generally
the case for loads, it is impossible to eliminate the RAW dependences with bypassing. In this
case the pipeline should be stalled or some delay slots must be inserted. Figure 2.11 shows how
a delay slot can be used to avoid pipeline stalls. An instruction, Instr. 1, should be inserted
between the two dependent instructions, LD.32 and MUL, with the condition that the Instr. 1
must not use the destination of the load. By this way, the load latency is masked. However,
it is not always possible to move a useful instruction into the delay slot, and sometimes the
compiler or the programmer should add a NOP. Note that bypassing is still used to reduce of
one cycle the data latency.
2.3.3 Resource Con�ict in Pipelined Execution
Figure 2.11 illustrates a common case of resource con�ict occurring between two instructions.
As the load operation has a latency one clock cycle higher than other instructions, the load
2.4 Superscalar Architectures 11
T1 T2 T3 T4 T5
CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�
ADD R1, R0, 4 ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�............F D E WB
MUL R2, R1, 12 ...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�....F D E WB
Bypass the result
Figure 2.10: Result bypassing avoids the pipeline stall due to one-cycle RAW data dependences.
CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HHH
LD.32 R1, (R0) ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVVVV�...........F D Addr Mem WB
Instr. 1 ...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�...........F D E WB
MUL R2, R1, 12 ...................�VVVVVV�VVVVVV�VVVVVV�VVVVV�....F D E WB
Instr. 1 should not use R1
The load value is bypassed
Figure 2.11: Delay slot due to a one-cycle load latency.
and its following instruction do the write back at the same time. If the processor has a limited
number of write register ports, this can result in a resource con�ict. In the case of Figure 2.11,
this is solved by allowing two simultaneous write backs, i.e., having two write register ports
instead of one. The other solution is to stall the pipeline when there is a resource con�ict,
implying a loss in performance.
2.4 Superscalar Architectures
Pipelined machines use ILP in an horizontal way: they execute several instructions in parallel
but instructions are still issued sequentially. Superscalar architectures enhance the basic pipeline
model by allowing the execution and issue of multiple instructions per clock cycle. Superscalar
processors fetch several instructions each clock cycle, analyze dependences between operations,
and dispatch the instructions to several functional units. All these operations are performed
by the processor's hardware, resulting in an overhead in circuit complexity, but conferring the
ability to be compatible with the code generated for the non-superscalar processors. Figure 2.12
shows the timing execution of a generic superscalar processor able to fetch, decode, and execute
up to two instructions in parallel.
Figure 2.13 shows a block diagram of a generic four-unit superscalar processor. At each
clock cycle the processor fetches up to four instructions into the instruction cache, and transmits
them to the decoder. Also, during the fetch the processor predicts the address of the next block
12 Instruction-Level Parallelism
CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�H
Instr. 1 ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�.........F D E WB
Instr. 2 ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�.........F D E WB
Instr. 3 ...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�.F D E WB
Instr. 4 ...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�.F D E WB
Figure 2.12: Execution timing of a generic two-issue superscalar processor.
of instructions. When the decoder receives the operations, it decodes the four instructions and
sends them to an instruction bu�er called Instruction Window or Reservation Station. Data
dependences are computed between all the operations being stored in the Instruction Window.
Depending on the dependences and on the resource availability, the processor sends the di�erent
instructions to the functional units. Finally, the results are written into the register �le through
a reorder bu�er. This latter is used to support out-of-order execution, which will be described
in the next sections.
With the superscalar model of execution the processor's performance not only depends
of the width of the pipeline (i.e. number of functional units), but also strongly depends on
processor's policy toward fetching, decoding and executing instructions, called instruction-issue
policy. The instruction-issue policy limits or enhances lookahead capability, and therefore the
ability to �nd independent instructions beyond the current point of execution.
The following sections describe and compare three di�erent instruction-issue policies. The
comparison is made through the same example, that can be found in [33], where six instructions
are executed according to the di�erent instruction-issue policies. The instruction sequence, from
i1 to i6, has the following constraints on parallelism:
� i1 requires two cycles to execute,
� i3 and i4 con�ict for a functional unit (edge 1 Figures 2.14, 2.15, and 2.16),
� i5 depends on the value produced by i4 (edge 2 Figures 2.14, 2.15, and 2.16),
� i5 and i6 con�ict for a functional unit (edge 3 Figures 2.14, 2.15, and 2.16).
To help to visualize the operation of the superscalar processor, the �gures show the pro-
cessor pipeline stage horizontally and show the clock cycles vertically.
2.4.1 In-order Issue with In-order Completion
The simplest instruction-issue policy is to issue instructions in the original program order (in
order issue) and to write the results in the same order (in-order completion). For accomplishing
2.4 Superscalar Architectures 13
ALULoad/Store
Buffer
ALU
MemoryData
Instruction Windowor
Reservation Station
Cache
Scheduling
Data
InstructionFile
PredictionBranch
Cache
MemoryInstruction
RegisterDecoder ReorderFETCH
Branch Unit
DataDependencies
Figure 2.13: Block diagram of a generic four-unit superscalar processor.
this, instruction issuing stalls when there is a resource con�ict or when a functional unit has
a result latency greater than one cycle. Figure 2.14 illustrates the in-order issue with in-order
completion policy. During cycle 3 the pipeline stalls because i1 requires two cycle to execute,
and during cycle 5 and cycle 7 the stalls are caused by the resource con�icts i3 ! i4 and i5 !
i6. These stalls prevent the processor from fetching new instructions, and therefore limits its
lookahead capabilities.
In-order issue with in-order completion has an inherent simplicity; however, the system
has a relatively low level of achievable performance. It is why this scheduling policy is generally
not used, even in scalar processors.
2.4.2 In-order Issue with Out-of-order Completion
A �rst step to improve performance is to allow out-of-order completion. Thanks to this new
degree of freedom, the pipeline should not stall when a functional unit needs more than one
cycle to generate a result. Figure 2.15 illustrates how instruction i2 completes out of order,
allowing to overlap the execution of i1 and i3 during cycle 3.
For processors supporting in-order issue with out-of-order completion, the pipeline stalls
when there is a resource con�ict or when an issued instruction depends on a result that is not
yet computed. Furthermore, output dependences should be taken into account, because two
instructions having the same destination register can not be completed out of order.
Out-of-order completion yields higher performance than in-order completion but requires
more hardware. Dependences should be checked between decoded instruction and all instruc-
tions in all pipeline stages. Also, hardware must insure that the result are written in the correct
order, to insure the register �le coherency.
A new problem introduced by the out-of-order completion is the exception handling. Some-
14 Instruction-Level Parallelism
i1 i2
i3 i4
i5 i6
12345
8
Writeback
6
Cycle
7
Execute
i1
i3i4
i5i6
i1i2
Stall
Stall
i6
i3i4
resource conflictflow dependenceresource conflict
i6i5
i4i4
Decode
i1i3
i2
1
321
2
3
Figure 2.14: Superscalar pipeline with in-order issue and in-order completion.
times, an instruction creates an exception. Once the exception routine has been completed, it
is necessary to restart the program execution so that it can continue as usual. The problem is
that the exception may have been detected as an instruction produced its result out of order.
Therefore, it is not possible to restart the program at the instruction following the excepting
instruction because subsequent instructions have already completed, and doing so will cause
this instruction to be executed twice.
Stall
i6
i4i6
i3
i5
i4
resource conflictflow dependenceresource conflict
i2i1
i2i1
Decode Execute
i1i3
i5
Cycle
i4i5i6
Writeback
i1i2
i3i4
i6
4
123
5
76
321
3
21
Figure 2.15: Superscalar pipeline with in-order issue and out-of-order completion.
2.4.3 Out-of-order Issue with Out-of-order Completion
With in-order issue the processor's lookahead abilities are limited because the decoder stalls
when there is a resource con�ict, a �ow dependence, or an output dependence between uncom-
pleted instructions. Therefore, the processor is not able to look beyond instructions with the
con�ict or dependence, even though subsequent instructions might be independent.
2.4 Superscalar Architectures 15
To surmount this problem, an instruction bu�er called the instruction window, is inserted
between the decode and execute stages. The instruction window is used as a pool of instructions,
allowing the processor to fetch instructions until the instruction window is full. Then, the
lookahead capability is only constrained by the width of the instruction fetch and by the size
of the instruction window. Operations can be issued from the instruction window and can be
executed out of order. The only constraint is to insure the correct program behavior.
resource conflictflow dependenceresource conflict
i4, i5, i6
i1, i2
i5
i3, i4
Decode
i1
i6i6
i6
123456
Cycle
i2
i3i4
i3 i4
Window Execute
i1 i2i1
i5
i2i3
Writeback
i1i4i5
i5
3
2 1
321
Figure 2.16: Superscalar pipeline with out-of-order issue and out-of-order completion.
Figure 2.16 shows the operation of a superscalar pipeline with out-of-order issue. Note that
the instruction window is not an extra pipeline stage, it is simply a bu�er where the decoder
can store instructions. By bu�ering instructions, the decoder is able to operate at a maximum
rate. This allows the processor to �nd more independent instructions. In our example, the
independent instruction i6 is issued out of order, concurrently with i4.
Compared to the in-order issue with out-of-order completion, out-of-order issue has to
deal with one more type of dependences, the anti-dependences. Therefore, the processor has to
insure that an instruction executed out of order does not prematurely modify a register.
2.4.4 Exception Recovery and Register Data�ow in Superscalar Processors
Aggressive scheduling policies are required to increase performance of superscalar processors.
However, techniques such as out-of-order issuing or completion introduce new problems in terms
of exception and instruction dependences handling.
High instruction throughput is obtained in superscalar processors by fetching and issuing
operations under the assumption that branches are correctly predicted. Such techniques require
a recovery and restart mechanism to insure the correct execution when a branch is mispredicted
or when an instruction cause an exception. To handle such cases, the processor maintains an
execution history with the following states [33]:
� The in-order state, composed of the most recent assignments performed by the longest
continuous sequence of completed instructions.
� The lookahead state, composed of the all assignments, starting with the �rst uncompleted
instruction, to the end of the sequence.
� The architectural state, composed of the most recently completed and pending assign-
ments to each register, relative to the end of the known instruction sequence (i.e., fetched
instructions).
16 Instruction-Level Parallelism
Figure 2.17 illustrates theses three processor states. Note that instruction (2) and (6) were
deleted from, respectively, the in-order state and the architectural state because they are not
the most recent assignment in their corresponding state.
R3 := ...(6)
R8 := ...(3)R7 := ...(4)R4 := ...(5)
R8 := ...(7)R3 := ...(8)
R7 := ...(2)
Completed
SequenceInstruction
R3 := ...(1)
R8 := ...(3)R7 := ...(4)
R7 := ...(2)
instructions
StateIn-order
R4 := ...(5)R3 := ...(6)R8 := ...(7)R3 := ...(8)
StateLookahead
R7 := ...(4)R4 := ...(5)
R8 := ...(7)R3 := ...(8)
R3 := ...(6)
StateArchitectural
R3 := ...(1)
Figure 2.17: In-order, lookahead, and architectural state for an out-of-order issue superscalar
processor.
One classical approach to store these di�erent states is to add a reorder bu�er [60] (see
Figure 2.13) in the processor. In this case, the register �le contains the in-order state and the
reorder bu�er stores the lookahead state. The architectural state is obtained by combining the
in-order and the lookahead state. Other variants of the recovery mechanism, such as history
bu�er, or reorder bu�er with a future �le, can be found in [33].
The other problem introduced by the out-of-order policy is that the anti and output de-
pendencies can limit the performance of the processor. As it is described in subsection 2.2.1,
register renaming can eliminate these kinds of dependences, and can be implemented in hard-
ware. For example, processors that have a reorder bu�er and use an associative lookup table
to form the architectural state provide a straightforward implementation of register renaming
[33].
2.5 Very Long Instruction Word Architectures
Superscalar processors with out-of-order execution achieve higher performance than scalar pro-
cessors. However, the drawback of the superscalar technique is the increase in circuit complexity.
Indeed, dependence checking, dispatch unit, instruction window, exception recovery mechanism,
branch predictor, multi-ported register �les, and reorder bu�er introduce a substantial circuit
overhead.
Very Long Instruction Word (VLIW) architectures are an alternative solution to exploit
ILP with a lower circuit overhead than superscalar processors. Similar to superscalar proces-
sors, VLIW architectures execute and issue more than one operation per clock cycle. However,
scheduling and data dependence analysis are moved from the hardware level to the compiler
level, resulting in an important decrease in circuit complexity. Figure 2.18 shows the execution
timing of a two-issue VLIW pipeline. The main di�erence with a superscalar pipeline (Fig-
ure 2.12) is that VLIW architectures fetch only one instruction (a large one) per clock cycle,
that encodes parallel operations designated for the di�erent functional units.
Figure 2.19 illustrates the block diagram of a generic four-issue VLIW processor. Compared
to the superscalar processor (Figure 2.13), there is no need for dependence analysis, instruction
window, and a hardware scheduler. The VLIW processor has exactly the same behavior as a
2.6 Compiler Techniques to Extract ILP 17
CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�H
Instr. 1 ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�.........F D E WB
...................�VVVVVV�VVVVVV�.........E WB
Instr. 2 ...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�.F D E WB
...........................�VVVVVV�VVVVVV�.E WB
Figure 2.18: Execution timing of a generic two-issue VLIW processor.
scalar processor: it fetches an instruction, decodes, and then executes it. However, the execution
can involve several functional units.
An example of the VLIW compiler tasks is shown in Figure 2.20 using the VLIW processor
of Figure 2.19. The dependence graph (Figure 2.20(b)) is computed from the sequential code
(Figure 2.20(a)), and shows how operations can be executed in parallel:
� At �rst, the two load operations can be executed in parallel,
� then, the shl and the add,
� and �nally, the sub.
However, VLIW processors have a limited number of resources and their schedulers have
to take into account such constraints. Figure 2.20(c) shows how four large instructions are
formed from the original code sequence according to the dependence graph and the resource
constraints. For example, the two loads are scheduled in two di�erent instructions because
there is only one load/store unit, and therefore the loads must be executed in sequence. Also,
when the scheduler is not able to �nd a su�cient number of independent operations, NOPs are
inserted explicitly, resulting in an increase in code size. Current VLIW processors, such as the
Texas Instrument 32C6201 [74] or the HP/Intel IA-64 [25] have special encoding mechanism
that reduces the extra NOPs insertion cost.
VLIW processors performance strongly depends on the capability of the compiler to extract
parallelism from a sequential program. Such compiler techniques play also an important role
for superscalar architectures, by breaking instruction dependences, and therefore giving more
opportunities to the processors to �nd parallelism in between instructions. The next section
gives a brief description of ILP techniques.
2.6 Compiler Techniques to Extract ILP
This subsection introduces some major compiler techniques to generate code for ILP architec-
tures. The main goal of such techniques is to break the barriers introduced by the instruction
18 Instruction-Level Parallelism
MemoryInstruction
CacheInstruction
Data Data
Prediction
ALU
MemoryCache
ALU
RegisterFile
Load/Store
Branch
FETCH
Decoder
Very Large Instruction
Branch Unit
Figure 2.19: Block diagram of a generic four-unit VLIW processor.
dependences. Although there is an abundant amount of research in ILP, this section only gives
an introduction to the concepts relevant to this work.
2.6.1 Basic Block Scheduling
In traditional sequential representation of programs, the code is composed of basic blocks (BB).
A basic block is a sequence of instructions that does not contain a branch (except for the last
operation) or a branch target (except for the �rst operation), and has the property that if one
instruction of the BB is executed, all other instructions are also executed. Figure 2.21 shows
how a program can be divided into basic blocks and represented by a control �ow graph.
Basic block scheduling consists in limiting the compiler scope to a basic block for the
parallelization of the instructions. This is a very simple algorithm and the performance im-
R0 R1
R2 R3
R4
add
ld ld
shl
sub
(c)(a)
(b)
(1)
(4)
(2)
(3)
add
nop nop
nop
nop
nop
shl
sub
ld
ld
nop
nop
nop
nop
nop
nop
(3) add r2, r0, 1(4) shl r3, r1, 1
(2) ld r1, label_y
(5) sub r4, r2, r3
(1) ld r0, label_x
Figure 2.20: Example of formation of VLIW instructions: (a) sequential code, (b) the corre-
sponding dependence graph, (c) the corresponding VLIW code.
2.6 Compiler Techniques to Extract ILP 19
E;
A;
D;
} else {
do {
}G;H;
} while (I)
if (C) {B;
F;F;
D;E;
jmp _L3
A;
(a) (b) (c)
A;_L1: B;
br C, _L2D;E;
F;_L2:G;H;
_L3:
jmp _L3
br I, _L1
B;br C, _L2
H;G;
br I, _L1
Figure 2.21: Control �ow graph with basic blocks: (a) original C code, (b) corresponding
assembly code, (c) corresponding control �ow graph.
provement is generally limited. Indeed, basic blocks contain only a few instructions, limiting
opportunities to �nd independent instructions. To override this limitation other techniques
such as trace scheduling [23], superblock scheduling [45], and hyperblock scheduling [44] that
enlarge the compiler scope have been proposed.
2.6.2 Superblock Scheduling
Superblock scheduling as well as trace scheduling focus on applying global optimization in favor
of the most frequently executed path. Trace scheduling divides functions in a set of traces that
represent the frequently used paths. These traces may contain several conditional branches
that go out of the trace (side exits) and several branch targets in the middle of the trace (side
entrance). Instructions are scheduled within each trace ignoring these control-�ow transitions.
After scheduling, bookkeeping is required to ensure the correct execution of the o�-trace code.
The major disadvantage of this technique is the increase in the compiler complexity due to
bookkeeping.
Superblock scheduling is derived from trace scheduling and aims to reduce the compiler's
complexity while o�ering an e�ective technique to extract ILP from a program. A superblock is
a trace with no side entrance. Figure 2.22 shows how superblocks are formed from the original
weighted �ow graph (2.22(a)). From this latter, the most frequently executed trace is formed
(2.22(b)). Finally, the side entrance is eliminated using tail duplication [20](2.22(c)).
Before superblock scheduling is performed, ILP optimizations are applied to enlarge the
compiler scope and to remove dependences. Enlarging optimizations are:
� Branch Target Expansion: branch target expansion expands the likely taken control trans-
fer which ends the superblock. The target superblock is copied and appended to the end
of the original superblock.
� Loop Peeling: superblock loop peeling is applied to loops that iterate, according to pro-
�ling information, only a few times. The loop body is replaced by straight-line code
consisting of the expected number of iterations. The original body of the loop is moved to
20 Instruction-Level Parallelism
E;jmp _L3
90%10%
D;
brn C, _L2
90%
H;G;
brn I, _L1
10%
B;
D;E;
F;
A;A;
B;brn C, _L2
brn I, _L1H;G;F;
Side entranceH;G;
brn I, _L1
E;D;
A;
jmp _L3
B;brn C, _L2
brn I, _L1H;G;F;
(c)(a) (b)
Figure 2.22: Superblock formation: (a) weighted �ow graph, (b) trace formation, (c) tail du-
plication.
the end, to handle the case when the loop should be executed more times than expected.
(see Figure 2.23(b))
� Loop Unrolling: superblock loop unrolling is applied to loops that tend to iterate many
times. To unroll a loop N times, N-1 copies of the superblock are appended to the original
superblock. (see Figure 2.23(c))
BB1
BB2
BB2
BB2
BB2
BB2
BB2
BB1
BB2
BB2
BB1
(a) (b) (c)
Figure 2.23: Loop enlarging optimizations: (a) original loop, (b) loop peeling, (c) loop unrolling.
Once the superblocks are enlarged, some optimizations are applied to eliminate depen-
dences between instructions. Some of these superblock dependence removing optimizations
are:
� Register renaming: eliminates arti�cial dependences such as anti and output dependences
(see Section 2.2.1).
� Accumulator variable expansion: an accumulator variable accumulates a sum or a product
at each iteration of a loop. Anti, output, and �ow dependences between instructions which
2.6 Compiler Techniques to Extract ILP 21
accumulate a total are eliminated by replacing each de�nition of accumulator variable (see
Figure 2.24(a!b) variable s).
� Induction variable expansion: induction variables are used within loops to index through
loop iteration and through regular data structure such as arrays. Due to the dependence
on induction variable computation, ILP is typically limited when loops are unrolled. In-
duction variable expansion eliminates rede�nition of induction variables by creating a new
variable for each de�nition of the induction variable, thereby eliminating all anti, output,
and �ow dependences among the induction variable de�nitions (see Figure 2.24(b!c)
variable i).
goto L1
s=0L1: if (i > n) goto exit
s=s+a[i]i=i+1 ite
r. 1
if (i > n) goto exit
i=i+1 iter.
2
s=s+a[i]
i=0
exit: m = s/i
L1:s2=0s1=0
goto L1
iter.
2ite
r. 1if (i > n) goto exit
s1=s1+a[i]i=i+1if (i > n) goto exits2=s2+a[i]i=i+1
exit:m = s/is=s1+s2
i=0
(b)(a)
L1:s2=0s1=0
goto L1
iter.
2ite
r. 1if (i1 > n) goto exit
s1=s1+a[i1]i1=i1+2if (i2 > n) goto exits2=s2+a[i2]i1=i2+2
exit:i = i1 + i2s=s1+s2
i2=1i1=0
(c)m = s/i
Figure 2.24: Dependence removing: (a!b) accumulator variable expansion, (b!c) induction
variable expansion.
Note that induction and accumulator variable expansion add extra instructions outside of
the loop body.
After ILP optimizations are applied, depending on dependences and resource availability,
superblock scheduling is performed. The scheduler can move instructions above a preceding
conditional branch within a superblock using a technique called speculation. Instruction specu-
lation breaks some of the control dependences that are in the superblock, resulting in an increase
in ILP. However, there are restrictions that limit speculation. These restrictions are, if I is the
speculated instruction and B is the conditional branch instruction where I is moved above:
� Restriction 1: the result of I must not be used before it is rede�ned when B is taken.
� Restriction 2: I must never cause an exception that may terminate program execution
when branch B is taken.
Restriction 2 is probably the most important constraint: exceptions caused by speculative
instructions which would not have been executed in the original program must be ignored.
Several hardware support were proposed to handle speculation of potentially trapping instruc-
tions such as loads, stores, or divides. The restricted percolation model includes no support for
disregarding the exceptions generated by the speculative instructions. Therefore, the compiler
can not move instructions that can potentially cause an exception above a branch. The main
limitation of the restricted percolation model is the inability to move potential trap-causing
22 Instruction-Level Parallelism
instructions with long latency, such as load operations, above branches. To overcome this lim-
itation the general percolation model eliminates the restriction 2 by providing a non-trapping
version of instructions that can cause exceptions. The non-trapping version is used when the
instruction is speculated. For programs in which detection of exceptions is important, sentinel
scheduling [41] allows, with additional hardware and compiler support, to handle exceptions
generated by speculated instructions.
2.6.3 Predicated Execution
Conditional branch instructions introduce control dependences (see subsection 2.2.2) that are
recognized as a major impediment to exploiting ILP. Branch prediction and instruction specula-
tion are techniques that reduce the e�ects of control dependences; however, conditional branches
can result in severe performance penalties due to mispredicted branches.
Predicated execution allows conditional execution of instructions based upon a computed
condition and may be supported by several di�erent architectural models [43]. Each model must
support a method of expressing the condition and a method for the condition to a�ect instruc-
tion execution. Full predication supports this using new instruction set and microarchitecture
extensions.
The full predication model consists of four components: a predicate register �le for holding
1-bit predicate values, an additional source operand for each instruction to specify a predicate
for instruction execution, a conditional-execution stage to nullify instructions, and a set of
predicate de�ning instructions for generating conditions. The values in the predicate register
�le are associated with each instruction through the use of an additional source operand, or
predicate operand. This operand speci�es which predicate register will determine whether the
instruction should execute. A predicate register value of 1, or true, indicates that instruction
is executed; a value of 0, or false, indicates that instruction is suppressed. An unconditional
instruction is designated by a predicate register that is always true. An architectural support
for predicated execution can be found in the HPL PlayDoh Architecture Speci�cation [22].
<p2>D = A + X <p2>
Z = Z - 1 <p1><p2>A = A + 1
} {else
C = C - 1<p2>C = C - 1<p1>
beq A, B
X = X + 1D = A + X
C = C - 1Z = Z - 1
A = A + 1
Z = Z - 1
(a)
B = B + 1;
(b)
B = B + 1
D = A + X
p1 = (A == B)p2 = (A != B)X = X + 1 <p1>
<p2>A = A + 1D = A + X
(d)
B = B + 1
p1 = (A == B)p2 = (A != B)
(c)
X = X + 1 <p1>D = A + X <p1>
B = B + 1
if (A == B)X = X + 1;
{
Z = Z - 1;
A = A + 1;
C = C - 1;
D = A + X
D = A + X
}
Figure 2.25: (a) A simple if-then-else C code construct, (b) unpredicated code, (c) predicated
code, and (d) optimized predicated code.
Predication support allows the compiler to use an if-conversion algorithm to convert con-
ditional branches into predicate de�ning instructions, and instructions along alternative paths
of each branch into predicated instructions [48]. Figure 2.25 demonstrates the limitation of the
traditional control �ow graph when applied to predicated code. A simple if-then-else construct
is shown in Figure 2.25(a). The code generated for this segment without predication is shown
in Figure 2.25(b). Here the control �ow graph clearly shows that one and only one side of the
if-statement may execute. The predicated code control �ow graph is shown in Figure 2.25(c).
2.7 Conclusion 23
In this case all the code falls into one basic block because there is no possibility of branching
until the end of the set of instructions.
The most notable modi�cation of predication to the instruction set encoding format is
the addition of the predicate operand source for every instruction. The predicate operand
increases the instruction size and has signi�cant e�ects on overall program code size. One
model [51] proposes a new set of predicate guarding instructions that would reduce the drawback
of existing methods of specifying predicated execution through the use of predicate mask-setting
instructions. Although the mechanism is useful in reducing the predicate operand overhead, the
general mechanism constrains several aspects of predicated execution and dramatically alters
the instruction issue logic of microprocessors.
There are two major bene�ts associated with applying if-conversion. First, a compiler can
eliminate problematic branches from the program. In doing so, all the associated overhead with
these branches is removed, including misprediction penalties, penalties for redirecting sequential
instruction fetch, and branch resource contention. Second, predication facilitates increased ILP
and speedup by allowing separate control �ow paths to be simultaneously executed.
2.7 Conclusion
This chapter gives a background of instruction-level parallelism. First, the main concepts of
ILP have been introduced. Second, several architectures that exploit ILP have been described.
Finally, the required compiler support for such architectures has been presented. This chapter
represents only a brief survey of some of the major ILP techniques: it describes the main notions
required to understand the rest of this work.
Instruction-level parallelism is mainly used to increase processor's performance; however,
parallelism can also be used to increase the energy e�ciency of a system. The following chapter
describes how parallelism can be used in a low-power context.
24 Instruction-Level Parallelism
Chapter 3
Power Consumption in CMOS Circuits
CMOS design exhibits a good trade-o� between circuit area and power consumption.
This is why the majority of current processors are implemented in CMOS technology.
However, as circuits becomes more and more complex, there is a steady increase in
power consumption, making power consumption the new major constraint of circuit design.
This chapter gives a general background of the power consumption in CMOS circuits.
First, the di�erent source of power consumption and their relative contribution are described.
Second, several metric are introduced and their meaning is explained. Finally, it is explained
how parallelism can be used to improve the more energy e�ciency of an architecture.
3.1 Sources of Power Dissipation in CMOS Circuit
The sources of power consumption of a CMOS circuit can be classi�ed as: the static power
dissipation, that is related to the logic state of the circuit and is due to the leakage currents
and other static currents; and the dynamic power dissipation, that is caused by the switching
activities of the circuit and is due to the short circuit currents, and the charge and discharge
of the load capacitance.
These sources of power consumption are described in the following subsections through the
example of a static CMOS inverter. Figure 3.1 shows di�erent representations of such inverter.
XX XX
Vdd
Gnd
Cload
PMOS
NMOS
X
(c)
Cload
Vdd
Gnd
X
(a) (b)
Figure 3.1: Static CMOS inverter: (a) gate, (b) transistors, and (c) switches representation.
25
26 Power Consumption in CMOS Circuits
3.1.1 Static Power Dissipation
Ideally, CMOS circuits have no static power dissipation because there is no direct path from Vddto Gnd. However, CMOS transistors do not behave as perfect switches, and generate leakage
currents that can arise from reverse bias diode currents and sub-threshold e�ects. These e�ects
are primarily determined by fabrication technology considerations.
Another source of static dissipation can appear when deviations from CMOS style circuit
design are used. For example, the pseudo NMOS logic circuit can be useful in the register �le
design due to e�cient area usage. Indeed, pseudo NMOS circuit does not require a P-transistor
network and saves half the transistors required for logic computation compared to the CMOS
logic. The main drawback of such a technique is that, depending of the output value, there is
a direct path from Vdd to Gnd. Therefore, a trade-o� between area and power consumption
should be made.
Static power dissipation represents less than 10% of the total power dissipation [19], and
therefore it does not represent the main target for power consumption reduction. However,
current microprocessors tend to have a very low power supply voltage, resulting in a low tran-
sistor threshold voltage, Vt. Such diminution implies an increase of the static currents, and
consequently the static dissipation contribution can be much more signi�cant, especially during
sleep modes.
3.1.2 Dynamic Power Dissipation
Dynamic power dissipation is the main source of power consumption in CMOS circuits (around
90% [19]). Dynamic dissipation comes from the switching activity of the circuits and has two
main components caused by the short-circuit currents, and by the charge and discharge of the
load capacitance.
Short-circuit power dissipation � Since NMOS and PMOS transistors do not behave
as perfect switches and do not commute exactly at the same time, there is a direct path from
Vdd to Gnd during a change of state. For example, during the transition of a CMOS inverter
(Figure 3.1), both transistors areON for a small amount of time . This phenomenon is illustrated
in Figure 3.2. The contribution of short-circuit current strongly depends on the time that both
Sho
rt-c
ircui
t cur
rent
P-transitor ON
N-transitor ON
Time
Time
Inp
ut v
olta
ge
Figure 3.2: Short-circuit current in a static CMOS inverter.
P and N transistors are ON, and therefore depends on the signal slope. Generally this mode of
3.2 Metrics for Energy E�ciency 27
power dissipation is 10-60% of the total power dissipation [80]; however, with a careful design
it can be kept below 15% [21].
Charge/Discharge capacitance power dissipation � The power consumption due
to the charge and discharge of the load capacitance dominates the total power dissipation.
Figure 3.3 shows an example of such situation for a static CMOS inverter. When the input
changes from 1 to 0 (Figure 3.3(a)) the load capacitance (Cload) is charged through the PMOS
transistor by a charging current (Icharge). The power supply has to deliver the required energy
to charge the capacitance:
E = Cload � V2
dd(3.1)
Half of this energy is dissipated by the PMOS transistor and the other half is stored in the
Cload capacitance. Then, when there is a transition of the input from 0 to 1, the energy stored
in Cload is dissipated through the NMOS transistor by a discharge current, Idischarge. In this
case the power supply does not need to furnish any additional energy. Therefore, the average
energy of a transition is:
Eavg =1
2� Cload � V
2
dd(3.2)
Icharge
0 to 1
Gnd
Vdd
(b)(a)
Vdd
Gnd
0 to 11 to 0
Cload
Idischarge
Cload1 to 0
Figure 3.3: Charge and discharge of the load capacitance in a static CMOS inverter.
Considering that the system works at frequency f and that the output has an activity
�, corresponding to the average number of times that Cload is charged or discharged per clock
cycle, then the resulting power consumption is the average dynamic power dissipation:
Pavg =1
2�Cload � � � f � V
2
dd=
1
2� Csw � f � V
2
dd(3.3)
where Csw = Cload � � is the switched capacitance.
As the charge/discharge capacitance power dissipation is known as the main source of
power dissipation [80][21], the following sections focus on this part of the power consumption.
3.2 Metrics for Energy E�ciency
There are many di�erent ways to measure the power or energy e�ciency leading to di�erent
results when comparing di�erent systems. This subsection introduces these di�erent metrics
and explains their meaning.
28 Power Consumption in CMOS Circuits
Power Dissipation, P � Power dissipation, measured in watt, is probably the most
straightforward way to measure the power e�ciency of a circuit. This metric can be useful for
packaging consideration, power supply dimensioning, cooling requirements, signal noise, and
reliability of the system. However, power dissipation depends on the clock frequency: a chip
running at a higher frequency improves its performance, but also increases its power dissipation.
Therefore, this metric is not useful in comparing the energy e�ciency of a chip, because a circuit
does not become more energy e�cient if it changes its clock frequency.
Energy Consumption E, or Power-Delay Product � As power dissipation represents
the rate at which energy is consumed, energy consumption, measured in joules, is another
alternative metric. This metric is useful when a system has to work at a �xed throughput
[16]. Thus, as showed in Subsection 3.3, the power supply voltage and the clock frequency are
correlated parameters, and can be scaled to meet the time requirements of the application. In
this case, the energy consumption of the system can be used to do comparisons, because an
architecture A consuming less energy than an architecture B to execute a task in a �xed amount
of time also dissipates less power.
The energy consumption of a microprocessor is often measured in �W/MHz, and represents
power dissipation per clock cycle. However, such a metric can be misleading when the compared
processors have a di�erent instruction set or architecture, because the number of instructions
and the number of clock cycles needed to execute each instruction can be very di�erent. Million
Instruction Per Second (MIPS) can be used to normalize the energy and compare processors
that have the same instruction set. The corresponding unit is �W/MIPS or, more commonly
found, its inverse MIPS/�W.
MIPS are not suitable to compare the performance of processors having a di�erent in-
struction set architecture, because the number of instructions needed to execute a task can be
very di�erent. Therefore, another performance metric should be used to normalize the energy:
for example SPEC numbers, resulting in �W/SPEC or its inverse SPEC/�W. SPEC numbers
correspond to the time needed to execute the SPEC benchmark suite. For other benchmarks,
the metric can be speci�ed as the power-delay product E = Pavg �Texec, where Texec is the time
needed to execute a task, and Pavg is the average power dissipation during the execution of this
task.
Energy-Delay Product, EDP � When a task needs to be executed at a maximum
speed, the power-delay product (or energy) becomes a misleading metric for comparing micro-
processors. Indeed, two processors may need the same energy to execute a task, while having
a di�erent energy consumption distribution, meaning that one of them can be n times faster
when dissipating n times more power. This latter is better in terms of energy e�ciency when
maximum performance is required. It is why the energy-delay product, EDP = E � Texec, was
proposed in [24]. A commonly found equivalent metric, that correspond to the inverse of the
EDP , is the MIPS2/�W or SPEC2/�W.
3.3 Parallelism for Energy E�ciency
Since the main source of power dissipation is quadratically related to the power supply voltage
Vdd, an often employed power consumption reduction technique is to scale down Vdd. However,
signal delays in CMOS circuit also depend on Vdd. Equation 3.4 shows that a reduction in Vddwill cause a decrease in the working frequency, which in turn will degrade the performance of
the overall system:
3.3 Parallelism for Energy E�ciency 29
Tdelay = K �Vdd
(Vdd � Vt)�; (3.4)
where Tdelay is the circuit delay, K is a technology and circuit implementation dependent
constant, Vt is the threshold voltage, and � is equal to two for micronic technology and decrease
when the technology becomes submicronic (� is around 1.5 for a 0.25� technology).
Parallelism [19] is one technique which can compensate the loss of performance due to
reduced clock speed. Indeed, parallelism enables a system to work at a lower frequency while
having the same performance as the equivalent non-parallel system running at a higher fre-
quency.
Figure 3.4 shows the result of a Spice simulation in terms of circuit delay and power
consumption of a simple circuit implemented in a 0.25� TSMC CMOS technology. The circuit
delay and energy consumption are reported for di�erent values of Vdd and are relative to a
reference voltage of Vdd = 2:5 volts. These graphs show that if parallelism can compensate a
loss in performance of a factor of two, the voltage can be down-scaled from 2.5 volts to around
1.4 volts, resulting in a power saving of around 70% if the overhead due to the parallelization
is neglected.
1 1.5 2 2.50.5
1
1.5
2
2.5
3
3.5
4
4.5
Power supply voltage: Vdd
Rel
ativ
e de
lay,
Vre
f=2.
5 vo
lts
1 1.5 2 2.50.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Power supply voltage: Vdd
Rel
ativ
e en
ergy
con
sum
ptio
n, V
ref=
2.5
volts
Figure 3.4: Relative circuit delay (left) and relative energy consumption (right) as function of
Vdd.
To help to understand this concept, Figure 3.5 shows a qualitative example of how par-
allelism in conjunction with voltage and clock frequency down-scaling can improve the energy
e�ciency of a processor while keeping performance at the same level. The energy distribu-
tion of several processor con�gurations is represented to help to understand how this kind of
optimization works. The di�erent con�gurations are:
� Con�guration P1+: a one-issue processor with Vdd = V+ and f = f+ (Figure 3.5(a))
30 Power Consumption in CMOS Circuits
Ec
Configuration P2+ Configuration P2-
Configuration P1-Configuration P1+
Pow
er D
issi
pa
tion
Pow
er D
issi
pa
tion
Pow
er D
issi
pa
tion
Pow
er D
issi
pa
tion
Execution Time
Execution Time Execution Time
Execution Time
Ta
Pa
Ea
Pb
Pd
Pc
Tc
Tb
Td
Eb
Ed
f-, V-
(b)(a)
(c) (d)
Voltage and frequency down-scaling
Para
llelis
m
f+, V+
P2: t
wo
-iss
ue p
roc
ess
or
P1: o
ne-i
ssue
pro
ce
sso
r
Figure 3.5: Energy distribution of several processor con�gurations executing the same task.
� Con�guration P1�
: a one-issue processor with Vdd = V�
and f = f�
. (Figure 3.5(b))
� Con�guration P2+: a two-issue processor with Vdd = V+ and f = f+ (Figure 3.5(c))
� Con�guration P2�
: a two-issue processor with Vdd = V�
and f = f�
(Figure 3.5(d))
Where V�
< V+ and f�
< f+.
P1+ and P2+, if no circuit overhead is associated to the parallel architecture, expend the
same energy when executing the same task, but the energy distribution is very di�erent. P2+dissipates twice as much power as P1+, but P1+ require twice as much time for executing
the selected task (Figure 3.5 (a) and (c)), resulting in a better EDP for P2+ (faster with
the same energy). When voltage scaling is applied to the processors to reduce their energy
consumption, their clock frequency must also decrease. For P1 (Figure 3.5(a! b)) this results in
decreasing both energy consumption and power dissipation, and in having a loss in performance.
Consequently, in a �rst approximation there is no gain in EDP (less energy, but also less
performance). Exactly, the same phenomenon occurs for P2 (Figure 3.5(c! d)).
However, when parallelism is used in conjunction with frequency and voltage down-scaling
(Figure 3.5(a! d)), one can observe that P2�
is much more energy e�cient than P1+. Indeed,
P2�
with a lower frequency and a lower power supply voltage has the same performance level
as P1+ (Ta = Td), while consuming less energy than P1 (Ed < Ea), and also dissipating less
power (Pd < Pa).
Table 3.1 qualitatively summarizes how parallelism and voltage scaling are a�ected by the
time of execution, the power dissipation, the energy consumption, and the energy-delay product.
3.4 Conclusion 31
This comparison is relative to the P1+. The signs +/- indicate respectively an improvement or
a degradation of the compared parameter.
Con�guration Time of Execution Power Dissipation Energy Consumption Energy-Delay Product
P1�
increased (-) decreased (+) decreased (+) equal
P2+ decreased (+) increased (-) equal decreased (+)
P2�
equal decreased (+) decreased (+) decreased (+)
Table 3.1: Summary of the bene�ts of parallelization and voltage down-scaling.
The above explanations do not take into account the circuit overhead introduced by the
use of parallel execution. An increase in complexity can dramatically reduce the bene�ts of
hardware duplication in terms of energy e�ciency. In low-power microprocessor design, it has
been demonstrated that pipelining is an e�ective way to improve a processor's energy e�ciency
[24], because of its inherent simplicity. Similarly, it was shown in [24] that the overhead for
superscalar general purpose architectures limits a processor's energy e�ciency. However, some
studies have suggested that EPIC and VLIW architectures, due to their hardware simplicity,
can execute a task with the same energy as an analogous scalar architecture [15][52].
3.4 Conclusion
This chapter gave an introduction to the power consumption in CMOS circuits. First, the
di�erent sources of power dissipation were described, and it was shown that the switching
activity contributes for around 90% of the total power dissipation. Second, di�erent metrics
were introduced and their meaning was explained in order to understand in which context
they should be used. Finally, it was explained through one Spice simulation and one qualitative
example how parallelism and voltage down-scaling can improve the energy e�ciency of a circuit.
These examples strongly motivate the use of parallelism for the design of an energy e�cient
microprocessor. However, at this point the power consumption added by the circuit overhead
introduced by the parallelization of the architecture was neglected. These negative e�ects are
investigated in the following chapters, but �rst, the next chapter describes the state of the art
in low-power and ILP processor design.
32 Power Consumption in CMOS Circuits
Chapter 4
Mobile and VLIW Processors:
a State of the Art
The embedded processor market o�ers a wide range of products that meet di�erent require-
ments of performance, cost, and power consumption. This chapter gives an overview of
some of these embedded processors with a main focus on low-power mobile processors. Also,
since parallelism relates to both performance and energy e�ciency, several VLIW architectures,
such as DSP or high-performance processors, are described. The main goal of this chapter is
to point out the main characteristics and trade-o�s of mobile processor designs, as well as the
lack of ILP exploitation in such processors.
4.1 The Advanced RISC Machine (ARM) Family
Exhibiting various desirable features, such as low-power consumption, tiny core size, and several
�exible modular options, the ARM [71] architecture has become one of the most popular prod-
ucts for ASIC design. With a 32-bit load/store architecture and a �xed-length 32-bit instruction
word, the ARM architecture follows the RISC standards. ARM is built around a scalar pipeline
that allows most of the instructions to be executed in one cycle, with the exception of memory
and branch operations.
From a programming point of view, the ARM family o�ers 16 32-bit integer registers
and an instruction set that has some original aspects. First, each and every instruction can be
conditionally executed upon the value of four condition codes, allowing to reduce the code size in
conditional branch intensive code. Second, all arithmetic and logic operations can intrinsically
shift or rotate one of their source operands.
4.1.1 The ARM7 Generation
Currently, ARM7 [71] is the low-end product of the ARM family. Based on a 3-stage scalar
pipeline (fetch, decode, execute), it exhibits a small die area, and a low power consumption.
Such features make ARM7 the perfect processor for low-power, low-cost applications. However,
this simple architecture, and particularly the short pipeline, results in a slow clock speed,
and therefore in a poor level of performance. As an example of an ARM7 implementation, the
ARM710 from VLSI, with an 8K uni�ed cache, an MMU, and implemented in a 0.8� technology,
runs at 25 MHz at 3.3 V, delivers 30 MIPS, and consumes 120 mW [1].
To overcome this performance limitation, the ARM's next generation introduced a longer
pipeline. The ARM8 has a conventional 5-stage pipeline which allows clock speeds of over
33
34
Mobile and VLIW Processors:a State of the Art
100 MHz in a 0.35� technology. ARM8's other major contributions to greater performance
were a static branch prediction and a double-speed cache that made transfers on both rising
and falling clock edges. Unfortunately, the double-speed cache of the ARM8 generated a new
problem. To e�ciently fetch instruction at 100 MHz, a custom physical layout of the processor
instruction fetch is required. Such design practices go against the ARM's premise of providing
easy to integrate portable CPU cores. This is why the ARM8 is no longer in the ARM roadmap.
4.1.2 The StrongARM
The StrongARM comes from a collaboration between ARM and Digital. The StrongARM SA-
110 [65] is a 32-bit embedded processor which exhibits a very desirable balance of performance
and power consumption. It is composed of a 5-stage scalar pipeline and is implemented on
a 0.35� technology. These parameters, in conjunction with a power e�cient design of the
processor core and cache memories, allow the SA-110 to achieve a high level of performance while
keeping the power consumption at a low-level. The SA-110 consumes around 500 milliwatts at
1.65 volt with a frequency of 160 MHz, and its level of performance is of 185 Dhrystone MIPS.
A new version of StrongARM SA-110 was implemented by Intel, reaching better perfor-
mance and a lower power dissipation. Intel's SA-110 runs at 233 MHz at 2 V, delivers 268 MIPS,
and consumes only 360 mW including the two 16K cache memories1 [1]. These features make
the SA-110 one of the best low-power processor that can be found in the market.
4.1.3 The ARM Thumb Option
Code size has strong repercussions on system cost, power consumption, and instruction cache
performance, that make it an important architectural issue. For this reason ARM designed
the Thumb architecture [63]. The ARM design addresses the code size issue by introducing
a 16/32-bit variable instruction width (ARM has a 32-bit �xed instruction width). Thumb
can switch between two modes of execution; one where it can execute 16-bit instructions that
maps the most frequently executed ARM instructions, and the other one where it can execute
32-bit instructions corresponding to the ARM instruction set. This variable instruction length
mechanism results in a 25% to 35% code size reduction.
To support this new mode of execution, ARM introduces a second decoder in parallel
with the original one. A program-visible bit directs incoming instructions toward the ARM
instruction decoder or the Thumb instruction decoder. The mode bit can be changed through
a new branch-and-exchange instruction, implying that 16-bit and 32-bit instructions can not
be randomly mixed.
4.1.4 The ARM Piccolo Option
The Piccolo [66] option adds DSP capabilities to the ARM by adding a DSP core into the ARM
architecture. Piccolo and ARM have separate registers and separate instruction memory and
communicate through a kind of reorder bu�er. With this option the ARM core controls the
chip and fetches the operands in memory, while Piccolo concentrates in the signal processing
computational part. One of the major drawback of this approach is that when the Piccolo
is running, a lot of the bandwidth of the ARM is lost in feeding Piccolo. Another problem
with Piccolo is that it has no X an Y memory band like the traditional DSPs. The Piccolo's
operands are supplied by its register �le, and su�er of a low operand bandwidth in case of a
data-intensive algorithm.
1The cache memories count for around 30% of the total power consumption
4.2 The Motorola M�Core 35
4.1.5 The ARM9 and the ARM10
The ARM9 [69] is the bridge between the ARM7 and the StrongARM. ARM9, as compared to
the ARM7, extends its pipeline to �ve stages, allowing it to run at 150 MHz. In addition, ARM9
splits ARM7's uni�ed internal bus and cache in two, giving the new core a Harvard architecture.
ARM9 does not include branch prediction (ARM8 does) and have a branch penalty of three
cycle when ever a branch is taken.
Implemented in the VLSI's 0.35� process, the ARM940T (the �nal T means that the
Thumb option is included) has a core die area of 4 mm2, runs at 150 MHz, and consumes 675 mW
at 3 V (including caches and MMU). Such features represent a signi�cant improvement over
the ARM7; however, the StrongARM still remains better in terms of performance and power
consumption.
This is why the ARM10 [26] pushes the ARM instruction set to a new performance level.
ARM10 has the same basic 5-stage pipeline as the ARM9, but it was reoptimized, allowing it
to reach 300 MHz in a 0.25� technology while having a 1 W power budget. ARM10 implements
a simple static branch prediction technique (backward taken, forward not taken). As in the
ARM9, there is a 3-cycle misprediction penalty; however, mispredictions occur less frequently.
ARM10 allows several units to work in parallel; however, it can issue only one instruction per
cycle. The ARM10's core can also be paired with a �oating point unit.
4.2 The Motorola M�Core
Motorola is well known in the embedded market for its 68xxx family; however, this family is be-
coming old, and Motorola introduced the new M�Core family [70] to compete in the burgeoning
market for portable hand-held devices. The M�Core family is designed to be a low-power 32-bit
architecture [57], and has 16 32-bit general-purpose registers. In addition, it has an alternate
register �le composed of 16 other registers that can be used for interrupt handlers or other
time-critical routines.
M�Core addresses the code size issue by using a 16-bit �xed-length instruction set. Subject
to this limitation, all register-to-register operations are destructive, with the result replacing
one of the source operands. Also, the immediate values are generally limited to 5-7 bits. Branch
displacement is coded in 11-bit value which covers 98% of branches [61]. This 16-bit coding
approach has a very attractive code density that is 50% smaller than ARM7 and 11% smaller
than Thumb code [70].
The MMC2001 [73] is one of the �rst implementations of the M�Core family and it is
dedicated to be an industrial controller. This microcontroller is based on an M�Core core
and integrates on the die: 256K of ROM, 32K of SRAM, and several modules such as pulse-
width modulation (PWM), UARTs, or serial-peripheral interface (SPI). Implemented in a 0.35�
technology and with a 2 V power supply voltage, M�Core runs at 34 MHz, delivers 31 MIPS,
and consumes 80 mW [46].
The M300 [72] M�Core generation makes a step toward a higher level of performance by
including a better branch handling technique (forward not taken, backward taken) and optional
single precision �oating-point support. Implemented in a 0.25� technology, this new generation
runs at 100 MHZ, with a power supply voltage of 2V.
36
Mobile and VLIW Processors:a State of the Art
4.3 The LSI TinyRisc
TinyRisc [68] is similar to the ARM's Thumb option, but for the MIPS instruction set. The
TinyRisc includes two decoders to handle 16-bit and 32-bit instructions having di�erent opcodes.
As with Thumb, 16-bit and 32-bit instructions can not be mixed, and a jump or call instruction
should be used to change of mode. The 16-bit mode has several limitations: (1) only 8 of the
32 registers are available; (2) most of the register-to-register operation are destructive; and (3)
immediate values are coded with one byte. In order to avoid a size limitation on indirect branch
o�sets, branch instructions automatically concatenate the next 16-bit instruction word, resulting
in a 26-bit branch o�set as in the 32-bit instruction set. Additionally, an EXTEND instruction
can be used to expand some of the immediate �elds of the 16-bit instruction, eliminating some
of the switching between 16-bit and 32-bit modes.
TinyRisc exhibits the same code size as the ARM7 with the Thumb option. However,
there is a cost in terms of performance because the extra logic added in the �rst pipeline stage
to support the 16-bit instruction set increases the critical path, thus leading to a reduction of
the clock frequency from 80 to 70 MHz. Note, that the operating frequency is still signi�cantly
higher than the conventional 40 MHz of the ARM7 with Thumb option.
LSI implemented the TinyRisc TR4102 in 0.25� fabrication technology, and it runs at
80 MHz, consumes 0.5 mW/MHz at 1.8 V, and has a die area smaller than 1.5 mm2.
4.4 The Hitachi SuperH Family
The SuperH [71] family has become very popular when Sega chose for its Genesis game console
a �rst-generation Hitachi SuperH.
The SuperH family passed through several generations from SH-1 to the current SH-4.
Again, the code size issue is addressed through a 16-bit �xed-length instruction set. As there
is no possible extension for the instruction word, there are some limitations: the size of the
immediate value is limited to 8 bits and the register-to-register operations are destructive.
The SH-1 generation with its 16-bit external bus, low clock speed, on-chip ROM, peripheral
functions, and lack of cache, is the lowest-performance device of the family and can be classi�ed
as a microcontroller. The SH-2 generation introduces only minor changes: a wider 32-bit
external data bus, a better multiplication unit and a 4K uni�ed cache. The most popular
processor of this family is the SH7604 that was used in the Sega Genesis 32X. The SH-3 family
makes a step toward a higher level of performance and targets application such as PDAs. As
compared to the previous SH-2 generation there is no major architectural changes, however,
the chips run approximately four times faster and includes an MMU, and larger uni�ed caches.
SH-3's chips such as the SH7708 can be found in several Windows CE units. The SH7708
typically dissipates 700 mW at 3.3 V, 100 MHz, with a level of performance of 100 Dhrystone
MIPS [64].
The latest SH-4 generation with the SH7750 [67] makes a substantial architectural change
supporting two-way superscalar execution and adding acceleration for �oating-point 3D geo-
metric processing. There are some restrictions for parallelizing instructions, for example the
SH-4 cannot dispatch two similar operations (ADD with ADD, �oat with �oat, etc), and it
can not mix certain multicycle instructions with others. However, the chip can mix integer and
�oating-point operation with no con�ict. The chip is implemented in a 0.25� technology, it
has a 5-stage pipeline, runs at 200 MHz, and consumes around 1.6 W when it is powered at
1.8 V [30]. It delivers 300 MIPS.
4.5 VLIW Architectures 37
4.5 VLIW Architectures
Currently, VLIW architectures are not commonly found in the processor market. In high-
performance workstation processors, VLIW architectures are beginning to appear with the
future HP/Intel IA-64 [25] [27] and the Transmeta x86/VLIW [10]. These designs are still not
available in the market: for example, Merced, the �rst generation IA-64, is expected in the
year 2000. In contrast, in the low-power embedded processor market VLIW machines have not
yet been introduced. For the moment, only Fujitsu is planning to design a low-power VLIW
processor [28]. The only domains where VLIW architecture can be found are within multimedia
and DSP processor systems. For example, the DSP Texas Instrument TMS320C6201 [74], the
Motorola/Lucent StarCore [78], or the Philips Trimedia [17]. This section gives a brief overview
of these VLIW processors.
4.5.1 The Texas Instrument TMS320C6201
The TMS320C6201 [74] is a VLIW-like DSP processor. It runs at 200 MHz and can issue up
to eight instructions per clock cycle. The core has an eight-way multi-issue 11-stage pipeline
that is divided in two clusters of four units. Each of the clusters contains a 40-bit integer ALU,
a 40-bit shifter, a 16-bit multiplier and a 32-bit adder. The register �le is composed of 32
general-purpose 32-bit registers, that are divided in two banks of 16 registers, one bank for each
cluster.
The instruction fetch mechanism includes a NOP elimination technique that reduces the
penalty due to the explicit NOP insertion required in conventional VLIW architectures. The
processor fetches a 256-bit meta-instruction, which is composed of 8 32-bit instructions. The
least signi�cant bit of each instruction is used to form execution packets among the 8 fetched
instructions. An execution packet de�nes a group of instruction that can be executed in parallel.
The next meta-instruction fetch is made once all the 8 instructions, contained in the current
meta-instruction, are sent to a functional unit. One important feature is that all the instructions
can be conditionally executed based on the status of �ve condition registers.
Even with the NOP elimination technique that reduces the code size penalty of the VLIW
architecture, the TMS320C6201 has signi�cant code expansion due to its deep pipeline, lack
of branch prediction, and �xed length 32-bit instruction. The fast 11-stage pipeline causes the
complex operation to have di�erent latencies, making the programming task much more di�cult
since there are several delay slots to �ll. For example, the 'C6201 has no branch prediction,
therefore all taken branches introduce a 5-cycle penalty which corresponds to a 40-instruction
(5 cycles times 8 instructions per cycle) branch delay slot. The number of delay slots in the
TMS320C6201 is unconventionally large and it is very di�cult to �nd a su�cient number of
delay slot instructions.
The TMS320C6201 exhibits a high power dissipation. In 0.25� technology it consumes,
including cache accesses, 4.65 W at 2.5 V, 200 MHz [62].
4.5.2 The Motorola-Lucent Star*Core
The Star*Core [78] is considered a new generation of VLIW DSP processor. It targets a wide
range of application by o�ering a scalable high-performance low-power VLIW DSP architecture.
The Star*Core uses 16-bit instructions and introduces optional instruction pre�xes that
enable the full power of a 32-bit instruction set. Such variable instruction length mechanism
has a code density that is much better than conventional DSPs, and comparable to those of
M�Core and ARM7 with the Thumb option [78].
38
Mobile and VLIW Processors:a State of the Art
The Star*Core SC140 [4][3] is the �rst core of the SC100 family, it is implemented in the
Motorola's HIP6 0.13� process, and it delivers up to 1.2 billion MAC (multiply-accumulate)
operation per second or 3000 MIPS. It has a total of 16 functional units, including MAC units,
ALUs, Bit Field units, Address Computation Unit. Also, it has di�erent sizes for the datapaths:
16 bits for the data, 32 bits for the addresses, and 40 bits for the accumulators. Star*Core's
pipeline is composed of only �ve stages. With a power supply voltage of 1.5 volts, it runs at
300 MHz and consumes 0.1 mA/MIPS.
4.6 The Philips Trimedia
The Philips Trimedia processor [17], is a 32-bit VLIW multimedia machine that targets digital
TV, and full-speed DVD decoding. The Trimedia architecture provides 128 general-purpose
registers, and 25 execution units, including constant generators, several ALUs, DSP execution
units, integer multipliers, integer shifters, branch units, load-store units, and �oating point
units. DSP units have instructions with special functions such as a single-cycle 8-bit motion
estimation that works on 32-bit operands. Trimedia also supports conditional execution. In
addition with the VLIW architecture, there is a compression hardware that avoids wasting
memory space and bandwidth with NOPs. The TM-1000 is built in 0.35� technology and runs
at 100 MHz, and consumes around 4 W (typ) at 3.3 V [49].
4.7 The HP/Intel IA-64
Intel and Hewlett-Packard work together to design the new generation of high-performance
workstation processors, the IA-64 [25]. The Merced will be the �rst chip of this family, and it
is based on a VLIW-like architecture called EPIC (Explicitly Parallel Instruction Computing).
Merced has a 64-bit datapath, and is supposed to have 128 integer registers and 128 �oating-
point registers. Furthermore, Merced is a fully-predicated execution architecture and has a
strong support for speculative execution.
To avoid NOP insertions, IA-64 groups operations in 128-bit bundles, that contains three
instructions and one template. The template is used to explicitly describe the available paral-
lelism between instructions within a bundle.
The Merced chip should be released in the year 2000, and it will be built in a 0.18�
technology. The chip is expected to run at around 800 MHz, and have performance advantage
of 20-30% over a RISC-like architecture. In terms of power consumption, a Merced module
containing 4M of full-speed cache is estimated to dissipate more than 70 W [27].
4.8 Comparison
Table 4.1 gives a summary of the features of each of the processors that were described in this
chapter. The upper part of the table is composed of mobile processors, and the lower part is
composed of VLIW processors. The MIPS performance are Dhrystone MIPS for the mobile
processors, and a MIPS peak number for the VLIW machine (numbers in italic). Furthermore,
numbers beginning with a '?' are estimates.
This table shows that de�ning the best processor is a hard task because of the several
design parameters. Considering the trade-o� between performance and power dissipation of
the mobile processors, the TR4102, the StrongARM, and the SH7750 dominate2 all the others
2A processor dominates an other processor when both performance and power dissipation are better.
4.9 Conclusion 39
Model Vendor Techno. Vdd Freq. Power MIPS MIPS/W MIPS2/mW
� StrongARM Intel 0.35� 2 V 230 MHz 360 mW 268 744 200
� ARM710 VLSI 0.8� 3.3 V 25 MHz 120 mW 30 250 8
� ARM940T VLSI 0.35� 3.3 V 150 MHz 675 mW ?160 ?237 ?38
� MMC2001 Motorola 0.35� 2 V 34 MHz 80 mW 31 387 12
� TR4102 LSI 0.25� 1.8 V 80 MHz 40 mW ?90 ?2250 ?203
� SH7708 Hitachi 0.5� 3.3 V 25 MHz 95 mW 25 263 7
� SH7750 Hitachi 0.25� 1.8 V 200 MHz 1.6 W 300 188 56
8. 'C6201 TI 0.25� 2.5 V 200 MHz 4.6 W 1600 348 557
9. SC140 Mot./Lucent 0.13� 1.5 V 300 MHz 500 mW 3000 6000 18000
10. TM1000 Philips 0.35� 3.3 V 100 MHz 4W 2500 625 1563
11. Merced HP, Intel 0.18� ? 800 MHz ?70 W 6400 91 585
Table 4.1: Mobile, Embedded, and ILP processor comparision.
54
350
125
250
375
500
625
1625
Power [mW]
MIPS300250
2
6
3
7
1
50 100 150 200 50 100 150 200 250 300 350
1
2
3
4
5
6
mW/MIPS
MIPS
4
62
3
5
1
7
Figure 4.1: Comparison: (a) MIPS vs. Power; (b) MIPS vs. mw/MIPS.
mobile processors (see Figure 4.1(a)). The TR4102 exhibits a very low-power consumption,
and it has the best MIPS/Watt rating, meaning that the TR4102 is the processor that require
the lowest energy to execute the Dhrystone benchmark. However, in terms of performance the
StrongARM and the SH7750 are much better.
For the energy-versus-performance trade-o� the same three processors are dominant (see
Figure 4.1(b)). The TR4102 and the StrongARM exhibits roughly the same MIPS2/W number,
which is equivalent to the energy-delay product. The SH7750 has a signi�cantly smaller number
due to its high power dissipation. The only factor that makes the SH7750 a dominant processor,
is its high level of performance, that might be needed in some time critical applications.
The level of power dissipation of the VLIW architecture is much higher than the mo-
bile processor with the exception of the StarCore that has a very low power consumption of
500 mW, while delivering a peak performance of up to 3000 MIPS. These numbers are di�cult
to compare to the ones of the mobile processors, because these architectures are dedicated to
very di�erent types of applications. Nevertheless, StarCore exhibits a very attractive trade-o�
between performance and power dissipation.
4.9 Conclusion
This chapter provided an overview of the best low-power 32-bit mobile processors that can be
found in the market. The features of architecture, design, instruction set, performance, power
consumption, and code size were described. Generally for embedded processors, ILP is only
40
Mobile and VLIW Processors:a State of the Art
exploited using pipelining techniques. There are a few exceptions: For example, the ARM10
allows several units to work in parallel; however, only one instruction can be issued per cycle.
Also, the SH7750 has introduced a superscalar pipeline; however, it has several restrictions
to parallelize instructions, and even if it has a very good level of performance, it consumes
much more power than the others mobile processors. Surprisingly, VLIW architectures have
not yet been introduced in the mobile processor market even though their inherent simplicity
can o�er low power consumption and improved performance relative to scalar architectures.
Current VLIW architectures are mostly found in DSP multimedia processors, that exhibit a high
instruction throughput, and can have a very low power consumption like the Motorola/Lucent
StarCore.
The previous chapters outlined that parallelism can be used either to speedup perfor-
mance or to reduce power dissipation. Both aspects of parallelism are very useful for low-power
mobile processors. The rest of this work investigates trade-o�s between energy consumption
and performance in VLIW machines, and how VLIW architectures can be introduced into a
low-power architecture. The next chapter gives a high-level evaluation of the bene�ts of VLIW
architectures for low-power processors.
Chapter 5
Low-Power VLIW Processors:
A High-Level Evaluation
Previous chapters describe that parallelism can either speed up the execution, or, when it is
used in conjunction with clock frequency and voltage down-scaling, reduce the total energy
consumed to complete a task with no loss of performance. Clearly, designers should embrace
techniques with this favorable characteristic whenever possible.
However, Chapter 4 showed that although pipelining is commonly integrated into low-
power embedded architectures, superscalar or VLIW architectures are generally not introduced
into embedded processors. Investigations into the overall energy e�ciency of pipelined and
superscalar architectures when used in general purpose processors demonstrated that super-
scalar execution does not signi�cantly e�ect the energy e�ciency of such processors [24]. This
is mostly due to the hardware overhead introduced by the superscalar architecture. The key to
solve this problem is in exploit parallelism and use pipelining while reducing the overhead found
in superscalar architectures through the use of advanced compiler techniques. With combined
hardware and compiler techniques, much of the work performed by traditional superscalar pro-
cessors can be moved from run time to compile time. New developments in the VLIW �eld, such
as the new architectural solutions HP/Intel IA-64 [25], TI 'C6201 processors [74], and the future
Fujitsu FR-V architecture [28], gives a strong motivation for the use of VLIW architecture for
low-power processors.
This chapter gives a �rst high-level quantitative evaluation of the bene�ts of VLIW ar-
chitectures for energy e�cient processors. In order to do so, several implementation of scalar
and VLIW architectures are compared in terms of both performance and energy consumption.
The remainder of this chapter describes the experiments that have been carried out to do this
evaluation.
5.1 Description of the Experiment
These experiments compare several scalar and VLIW architectures in order to determine whether
VLIW architectures can improve processor energy e�ciency. To do so, several VLIW architec-
tures are derived from the existing CoolRISC family [50], and a comparison is made in terms of
performance and energy consumption. This comparison is made through high-level estimates
of the energy consumption and through the performance achieved in some local piece of code.
As most of the execution time of a program is spent in inner loops, the performance achieved
and the energy consumed in inner loops can be considered as representative of the execution of
41
42
Low-Power VLIW Processors:A High-Level Evaluation
the entire program. For example, on a HP-PA 7100 processor, 78% [38] of the execution time
of the Perfect Club Benchmark Suite [11] is spent in inner loops. So, this experiment focuses
on the execution of a benchmark suite composed of inner loops of several programs. Such con-
siderations lead to a loss of precision compared to real code execution on real circuits; however,
this evaluation provides the initial validation for using of VLIW architectures in low-power
processors before designing a complete framework composed of a compiler and a circuit.
To run this experiment the functionalities of a framework developed at the Universitat
Politecnica de Catalunya (DAC, Barcelona, Spain) were extended in order to generate the
code for our di�erent architectures. Figure 5.1 shows the block diagram of this enhanced
framework. The inner loops are extracted from the benchmarks and optimized thanks to the
ICTINEO tool [9], which extracts the inner loops of a FORTRAN program and provides an
optimized graph of dependences for each inner loop. ICTINEO performs several optimizations
in order to eliminate the unnecessary dependences (which limit instruction-level parallelism) and
instructions. The elimination of the unnecessary dependences is performed by keeping high-
level information about the data dependencies: for example, in case of an access to an element
of an array, ICTINEO keeps the index of the element, which allows the memory dependences to
be identi�ed. Indeed, if this information is not kept at the assembly instruction-level, it would
be impossible to know if the indirect memory accesses to the various elements of the array
are independent. The elimination of the unnecessary instructions is achieved using common
expression elimination and invariant extraction. The �rst method eliminates those groups of
instructions that produce the same results. The second extracts from the loop those expressions
that compute a result which does not depend on the iteration. These optimizations reduce the
number of instructions to be executed.
Benchmark
Inner Loop Extraction
Code GeneratorMachine Description
Loop Optimizations
CoolRISC
SchedulingSMS: Swing Modulo Scheduling
PerformanceEnergy Sonsumption
ICTINEO
Figure 5.1: Block diagram of the experimental framework.
After that, the code and its corresponding dependency graph are generated for the Cool-
RISC 8-bit and 16-bit instruction sets. Then, using the dependence graph and a machine
description, software pipelining has been used to schedule operations because it is the most
e�ective compiling technique for loop parallelization. The software pipeline technique used is
Swing Modulo Scheduling (SMS) [39]. SMS tries to produce the maximum performance and,
in addition, includes heuristics to decrease the high register requirements of software pipelined
5.2 CoolRISC 816: A Low-power 8-bit Processor 43
� � � � � � � � � � � � � � � � �
� � � � � � � � � � � � � � � � � �
� �
� �
� � � �
� �
� � � �
� �
� � � �
� �
� � � �
� �
� � � �
� �
� � � �
� �
� � � �
� �
� � � �
� �
� �
� �� � � � � � � �
� � � � � � � � � � � �
� � � � � � �
�
� ��!
�
"
� � � � � � � � � � �
� � � � � � � � � � �
� � � � � � � � � � �
� � � � � � � � "
� � # � � � � � � � � � � � � � � � � �
Figure 5.2: Parallel execution of a loop using software pipelining.
loops [36]. When the number of registers required is higher than the available number, spill
code [18] (i.e, instructions which temporarily save the contents of some registers into the data
memory) has to be introduced, increasing energy consumption. When required, spill code was
added in software pipelined loops using the heuristics described in [37].
Figure 5.2 shows the principle of software pipelining. In the sequential execution of a loop
each iteration and each instruction are executed sequentially. Software pipelining rearranges
the instructions, according to the dependencies and architectural constraints, in order to obtain
a loop divided in SC stages (three in our example) which can be executed in parallel. Every
stage is executed in II (Initiation Interval) cycles, and multiple instructions can be executed in
parallel.
Following sections describe the compared architectures, the consumption model, and �nally
give and comment the results.
5.2 CoolRISC 816: A Low-power 8-bit Processor
CoolRISC 816 [50] has been developed by the Centre Suisse d'Electronique et Microtechnique
(CSEM, Neuchatel, Switzerland). As this thesis aims at extending the features of the CoolRISC
family, the CoolRISC 816 is the base line processor for this experiment.
Following subsection presents the CoolRISC 816, pointing on the main architectural fea-
tures, the performance limitation, and the energy consumption distribution.
5.2.1 The CoolRISC 816 Architectural Characteristics
The CoolRISC 816 is designed to be an ultra low-power embedded 8-bit microcontroller, and
has the following characteristics (core only):
� Harvard architecture: separate code and data memory
� Three-stage non blocking pipeline (IPC=1.0)
� Sixteen 8-bit registers
� 22-bit wide instructions
44
Low-Power VLIW Processors:A High-Level Evaluation
� A maximum of 64k x 22 bits of ROM code memory
� A maximum of 64k x 8 bits of RAM data memory
� 8b x 8b parallel-parallel multiplier
� Clock frequency of up to 18 MHz
� Typical consumption of 105 �W/MHz at 3 volts
� 19,000 transistors
� 0.5 �m three metal layers CMOS technology (Mietec)
� 0.8 mm2 area
The CoolRISC instruction set contains low-power instructions such as FREQ or HALT,
which allow, respectively, to reduce the microcontroller's clock frequency and to stop all ac-
tivity in the processor. The CoolRISC's addressing mode includes direct addressing, indirect
addressing with o�set, and pre-decrementation or post-incrementation. The ALU's operands
may be registers, immediate values, or memory data. The ALU's result is always stored in a
register which can be di�erent from the operand registers.
These features allow CoolRISC 816 to obtain an ultra-low power dissipation while achieving
a good level of performance compared to the other 8-bit microcontrollers that can be found in
the market.
5.2.2 The Performance of CoolRISC 816
The CoolRISC 816 has a non blocking pipeline which allows it to execute an instruction every
cycle without adding extra delay due to pipeline stalls. From a performance point of view, the
CoolRISC architecture's primary limitation is its clock frequency: the maximumclock frequency
of the CoolRISC 816 core is 18 MHz. Generally, the maximum working frequency is limited
by the access time of the code memory and the CoolRISC sacri�ces access time for low power.
Table 5.1 shows the energy consumption and the access time of the code memory used by the
CoolRISC 816 for a power supply voltage of 3V. As the access time of the code memory must
be one quarter of the clock period, at 18 MHz the required memory access time is 15 ns. This
means that for code memories with a size greater than 4k words the maximum clock frequency
is imposed by the memory access time.
ROM Size Energy (typ) Acces Time[�W/MHz] [ns]
256 x 22 75 54k x 22 205 2016k x 22 375 40
Table 5.1: Characteristics of CoolRISC's low-power ROM (Vdd=3V).
5.3 Compared Architectures 45
5.2.3 The Energy Consumption of the CoolRISC 816
The energy consumption of the CoolRISC 816 can be divided in three di�erent parts: the
core, the data memory, and the code memory. Figure 5.3 shows the typical distribution of
the energy consumption when CoolRISC is executing a program. This data was obtained by
executing a set of programs and extracting the relative utilization of the core, data memory,
and code memory. The set of programs used consisted of: a quicksort, a stringsort, a FFT, and
a sine/cosine computation. The average resource utilization is presented in Table 5.2.
49%
26% 16%
35%
50%56%
16% 24% 29%
0%10%20%30%40%50%60%70%80%90%
100%
256x22, 128x8 4kx22, 2kx8 16kx22, 8kx8
Core Code memory Data memory
CodeMemory
DataMemory
Figure 5.3: Energy consumption distribution in the CoolRISC 816.
Table 5.2: Relative utilization of the core, the code memory, and the data memory
Core 100%, the core is used every timeCode memory 100%, one instruction is fetched at each cycleData memory 40% of instructions access the data memory
Figure 5.3 shows that the energy consumed by the processor core corresponds to less than
50% of the total energy consumption and that the major sources of energy consumption are the
memories.
5.3 Compared Architectures
This section introduces the di�erent architectures that are compared in this experiment. All
of these architectures are based on the CoolRISC architecture and use the same low-power
memories that have been used with CoolRISC 816. These memories have the property of having
no sense ampli�ers, consequently static energy consumption can be neglected. Therefore, there
is no additional penalty due to the width of the instruction words.
The evaluated architectures are divided into two groups: the 8-bit and 16-bit scalar archi-
tectures, the 8-bit and 16-bit VLIW architectures.
46
Low-Power VLIW Processors:A High-Level Evaluation
5.3.1 Scalar Architectures
The compared scalar architectures are the 8-bit and 16-bit coming from the CoolRISC family.
The CoolRISC 816 (C8) is the base line processor of this experiment and is described in
Section 5.2. The CoolRISC 1616 (C16) processor is a 16-bit version of the CoolRISC 816, the
only di�erence being that all the data are 16-bit wide.
5.3.2 VLIW Architectures
VLIW architectures may su�er of an increase in code size due to explicit NOP insertion. To solve
this problem the new generation of VLIW processors, such as the TI'C6201 [74] and HP/Intel
IA-64 [25], contain special encoding techniques which eliminate the extra NOP instructions.
Figure 5.4 illustrates this technique. Each VLIW instruction encodes several operations (four in
our example) which could be dependent or independent. An additional �eld is added to specify
the group of operations that will be executed in parallel. The unit number �eld speci�es
which unit must execute the operation, and the separator bit between two operation within
a operation is set to '0' if the two can be executed in parallel, to '1' if they must be executed
sequentially. The hardware costs of the NOP elimination are the extra bits added to the code
memory (3 bits per operation in our example) and the crossbar needed to send the operation
to their corresponding unit. However, this technique prevents the increase in code size (and
therefore of consumption) due to the extra NOP insertion. For example, in our experiment
a VLIW processor with four units has a speed-up of about 2. This means that 50% of the
operations are extra NOPs. Therefore, a VLIW architecture with extra NOP elimination will
have a decrease in the code size by a factor of two compared to a VLIW processor without extra
NOP elimination. This NOP elimination technique is used in all the compared architectures.
� $ � � $ � % & "' ( ( �
� $ �� $ � ' ( (
� $ � � $ �" $ ' (" $ ' (
� � � ) ) � � � � � * " � +
% & "' ( ( � " $ ' (" $ ' ( � � �� � �� � � � � �
' ( (" $ ' (, , - - -, , - - - � � �� � � � � � � � �
" $ ' (
� $ �
� $ �
" $ ' (
� $ �
" $ ' (
� $ �
' ( ( �
' ( (
� $ �
% & "
� $ �
���.
&������!/��
�� ������
��0�*"�+
� � ! � � ) ) � �
� � ) � � � � � � � �
� � � ) ) / � �
� � � � � � �
� � ) � � � � � � � �
��
� �
�
� �
�
" $ ' ( �
Figure 5.4: VLIW architecture: NOP elimination.
Heterogeneous VLIW architectures� Heterogeneous VLIW architectures are the
most common among existing VLIW architectures. The term heterogeneous indicates that the
units are di�erent, which in turn means that an operation must be dispatched to a unit capable
of executing it. The compared architectures are the following:
5.4 Consumption Model 47
� V8E1: 8-bit VLIW, 1 Branch unit, 2 ALUs, and 1 Load/Store unit;
� V8E2: 8-bit VLIW, 1 Branch unit, 2 ALUs, and 2 Load/Store unit;
� V16E1: 16-bit VLIW, 1 Branch unit, 2 ALUs, and 1 Load/Store unit;
� V16E2: 16-bit VLIW, 1 Branch unit, 2 ALUs, and 2 Load/Store unit.
Homogeneous VLIW architectures� Homogeneous VLIW architectures are VLIW
architectures composed of several units which are able to execute any kind of operation. We
compare the following architectures:
� V8H1: 8-bit VLIW, four homogeneous units with one memory access at a time;
� V8H2: 8-bit VLIW, four homogeneous units with two memory accesses at a time;
� V16H1: 16-bit VLIW, four homogeneous units with one memory access at a time;
� V16H2: 16-bit VLIW, four homogeneous units with two memory accesses at a time.
The NOP elimination technique described below is used in all of these VLIW architectures.
Nevertheless, when the units are homogeneous there is no need for a crossbar and an unitnumber �eld to dispatch the operations to their corresponding units. An operation, according
to its position into the VLIW instruction, is always executed by the same unit.
5.4 Consumption Model
The consumption model is based on the utilization of resources. The energy needed to execute
a task is computed by adding the energy consumed by the di�erent resources:
Eoper Energy needed by the processor core for executing an operation;
Ecode Energy needed for an access to the code memory;
Edata Energy needed for an access to the data memory;
Econn Energy consumed in the interconnection (e.g., crossbar);
ERFover Extra energy consumption due to the increase in the number of register �le
access ports.
After executing a loop, it is possible to know the number of accesses to the various re-
sources, Nresource�name, and therefore to compute an estimate of the energy consumption:
ET = Noper �Eoper +Ncode �Ecode +Ndata �Edata +Nconn �Econn +NRF �ERF : (5.1)
5.4.1 Estimate of Eoper
As CoolRISC 816 is our processor of reference, we base the energy consumption estimates on
the energy consumption characteristics of the C8 processor, which are extracted from the real
implementation of the processor.
Because the compared VLIW architectures use the NOP elimination technique, their in-
structions contain predecoded bits that indicate which units must work. Therefore, it is possible
to halt the signals activity of all unused units. As a consequence, the units which do not execute
an operation do not consume any energy.
48
Low-Power VLIW Processors:A High-Level Evaluation
For the heterogeneous VLIW architectures, the energy needed to execute an operation
Eoper is estimated as the energy consumption of the operational part of theC8 orC16 processor(pipeline, decoder, register �le accesses, ALU operation).
For the homogeneous VLIW architectures, the energy needed to execute an operation Eoper
is estimated as the same energy needed to execute a scalar instruction in the C8 or C16.
This high-level modeling of the energy consumed during the execution of an instruction
implies a certain loss of precision. However, one factor limits the impact of the error of esti-
mation: as described in Subsection 5.2.3, the energy consumed in the processor core represents
only a small part (about 30% to 50%) of the total energy consumption.
5.4.2 Estimate of Ecode and Edata
The energy needed to execute a memory (code or data) access is estimated through a statis-
tical energy consumption model of the memory architecture. This model takes into account
the type of memory (RAM or ROM), its size (in words), its geometry (number of rows and
columns), the width of the word, and the power supply voltage. The technological parameters
are extracted from a 0.5 � CMOS process. In our experiment we use the typical value of the
energy consumption per memory access.
5.4.3 Estimate of Econn and ERF
The extra consumption energy due to the interconnection is estimated using a statistical model
of the energy consumption of the crossbar and of the circuit overhead due to the additional
access ports of the register �le.
5.5 Benchmarks
For experimental evaluation, we used a set of 25 integer loops. These loops are divided into
three groups. The �rst includes �ve integer loops which operate on 8-bit data: FIR �lter, vector-
matrix multiplication, vector-vector multiplication (dot), vector-vector addition, and function
integration. The second consists of the same �ve integer loops operating on 16-bit data. Finally,
the third group is composed of 15 16-bit integer loops of the Perfect Club Benchmark Suite [11].
5.6 Results
In this subsection we compare the performance and energy consumption of the architectures
described in Subsection 5.3.1. The same power supply voltage (Vdd=3V) and clock frequency
(imposed by the access time of the code memory) is used for all the compared processors, and
the experiment is repeated for several memory con�gurations.
In Figure 5.5 we compare the performance, in terms of speed-up, of the di�erent processors
with respect to the C8 processor.
Figure 5.6 shows the ratio between the energy consumption of the di�erent processors and
the C8, while executing our benchmark. It illustrates the energy consumption distribution.
Figure 5.7 shows the ratio between the energy-delay product achieved by the di�erent
processors and by the C8 processor, while executing our benchmark.
From these three �gures we can observe the advantage of the transition: (1) from a 8-bit
to a 16-bit architecture, and (2) from a scalar to a VLIW architecture.
5.6 Results 49
1.0
2.32.5
3.1
2.4
5.15.4
7.36.8
2.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
C8 V8E1 V8E2 V8H1 V8H2 C16 V16E1 V16E2 V16H1 V16H2
Figure 5.5: Speed-up comparison.
CODE: 256 INSTRUCTIONSDATA: 128 WORDS
CODE: 4k INSTRUCTIONSDATA: 2k WORDS
CODE: 16k INSTRUCTIONSDATA: 8k WORDS
0
0.2
0.4
0.6
0.8
1
1.2
C8
V8E
1
V8E
2
V8H
1
V8H
2
C16
V16
E1
V16
E2
V16
H1
V16
H2
C8
V8E
1
V8E
2
V8H
1
V8H
2
C16
V16
E1
V16
E2
V16
H1
V16
H2
C8
V8E
1
V8E
2
V8H
1
V8H
2
C16
V16
E1
V16
E2
V16
H1
V16
H2
core code data interconn.
Figure 5.6: Energy comparison.
50
Low-Power VLIW Processors:A High-Level Evaluation
CODE: 256 INSTRUCTIONSDATA: 128 WORDS
CODE: 4k INSTRUCTIONSDATA: 2k WORDS
CODE: 16k INSTRUCTIONSDATA: 8k WORDS
1.00
0.47
0.35
0.23
0.11 0.12
1.00
0.34
0.23
0.11 0.11
1.00
0.480.45
0.33
0.22
0.11 0.11
0.460.390.40
0.090.090.090.09 0.090.09
0.49
0.37
0.49
0.00
0.20
0.40
0.60
0.80
1.00
1.20
C8
V8E
1
V8E
2
V8H
1
V8H
2
C16
V16
E1
V16
E2
V16
H1
V16
H2
C8
V8E
1
V8E
2
V8H
1
V8H
2
C16
V16
E1
V16
E2
V16
H1
V16
H2
C8
V8E
1
V8E
2
V8H
1
V8H
2
C16
V16
E1
V16
E2
V16
H1
V16
H2
Figure 5.7: Energy-Delay Product comparison.
The transition from a 8-bit to a 16-bit architecture yields a major improvement in the
energy-delay product (approximately a factor of four). This is a consequence of the smaller
number of instructions required to execute the benchmark, which contains a majority of 16-bit
data. Performance increases by a factor of about 2.4 while energy consumption decreases by a
factor of about 1.7. This result shows the importance of having an architecture able to process
e�ciently the data of the application.
The transition from a scalar to a VLIW architecture signi�cantly improves the energy-delay
product (by a factor varying between 2.0 and 2.8). Indeed, VLIW architectures achieve better
performance while consuming approximatively the same energy as scalar architectures. This
observation is explained by a redistribution of the energy consumption: the increase in energy
consumption of the VLIW core is compensated for a decrease in the energy consumption of the
code memory. The increase is due to the circuit overhead introduced by the interconnections
(ERF and Econn). On the other hand, the decreased consumption of the code memory can be
explained. First, the employed memories do not have sense ampli�ers, and therefore do not
consume static energy (i.e., there is no penalty for the larger instruction words). Second, a
VLIW processor requires less energy to fetch an operation than a scalar architecture. In fact,
as the energy consumed by the line decoder is independent of the width of the word, the energy
needed to fetch four instructions simultaneously is less than four times the energy consumption
needed to fetch one instruction.
The main di�erence between homogeneous and heterogeneous VLIW architectures is in
terms of performance. The former reach a higher level of performance due to their higher
machine parallelism; however, the downside is the higher core complexity, which entails a higher
energy consumption. Therefore, if the ILP is insu�cient, the homogeneous and heterogeneous
VLIW architectures have a similar energy-delay product since there is no signi�cant di�erence in
terms of speed-up. On the other hand, if su�cient ILP can be extracted then the homogeneous
architecture attains a higher level of performance (as is the case for our 8-bit processors),
5.7 Conclusion 51
ultimately eclipsing the heterogeneous one with respect to the energy-delay product.
5.7 Conclusion
In this chapter we have shown that an adaptation of high-performance architectures, such as
the VLIW architecture, to low-power embedded 8 bit or 16 bit microcontrollers using low-power
memories yields a signi�cant improvement of the energy-delay product compared to a scalar
processor. This improvement, by a factor varying between two and three, is obtained through
a redistribution of the energy consumption, which enables a higher level of performance while
keeping the energy consumption at the same level. We have also shown the importance of using
a processor adapted to the size of the data in order to minimize the number of instructions
executed, which leads to a decrease in the energy consumption and in the time of execution.
Our results are based on loop parallelization and on a high-level energy consumption
model, which allow us to validate the use of VLIW architectures for high-performance low-
power processors and to identify which VLIW architecture provides the best results. The next
step will be to develop a complete VLIW compiler and a prototype of such a low-power VLIW
processor in order to extend these results to the real world.
52
Low-Power VLIW Processors:A High-Level Evaluation
Chapter 6
The DEVIL Low-power Processor
The previous chapter has validated the use of VLIW architectures for low-power processors
using high-level estimates. The next step in the design �ow is to de�ne and implement a
low-power VLIW processor with its compiler in order to obtain more accurate results about the
system features.
This chapter describes the instruction set architecture of DEVIL, our low-power VLIW
architecture. First, the VLIW's design trade-o�s are revisited in order to highlight the important
points that have to be taken into account when de�ning a new VLIW architecture. Second,
the DEVIL architecture is described and the design decisions are motivated. Third, the DEVIL
processor is evaluated in terms of performance and memory utilization. Finally, a comparison
with the existing instruction set architecture is made.
6.1 Where Is The Complexity in VLIW Architectures?
Introducing a multiple-issue pipeline into a processor adds complexity to the architecture. This
section revisits the architectural changes that are necessary to introduce a VLIW-like pipeline
into an architecture in order to better understand what the trade-o�s are and where new
solutions should be found.
6.1.1 Hardware Duplication
The most obvious increase in complexity is probably the hardware duplication (i.e., unit repli-
cation) needed to execute more than one instruction per cycle. Hardware duplication results in
an increase in the circuit die area, implying a higher circuit cost and potentially a higher power
consumption.
Although increasing the number of functional units (FUs) of a superscalar pipeline raises
the number of instructions that can be executed in parallel, there are other factors that limit
the achievable parallelism:
� The number of registers of the architecture versus the register pressure,
� The number of register ports,
� The type of instructions,
� The ILP available in the application (dependencies).
53
54 The DEVIL Low-power Processor
Therefore, the amount of hardware duplication should be adapted to these constraints.
Number of FUs versus number of registers � Executing more than one instruction
per cycle increases the register requirements. When the number of live registers is greater than
the number of available registers, spill-code should be inserted to temporary save and restore
registers to and from data memory. As the number of accesses to the data memory that can be
made in parallel are generally bounded (usually no more than two), this extra code is likely to
result in a degradation of system performance.
Number of FUs versus number of register �le ports � Adding extra functional
units also means that the units must exchange data via the register �le and the data memory,
resulting in an increase in the number of port accesses to such data storage elements. However,
increasing the number of port accesses directly a�ects the complexity, access time, and power
consumption of the register (or of the memory). This explains why the number of access ports
is generally limited, bounding the machine parallelism.
Number of FUs versus types of FU � The type of instructions in a program (e.g.,
branches, ALU operations) can also limit the e�ciency of the hardware duplication. Indeed,
although it is simple to parallelize computationally-intensive code, it is much more di�cult to
parallelize conditional branches, and generally processors can issue only one branch per cycle.
As applications contain a signi�cant amount of branches (around 20% of the total number of
instructions [29]), this severely limits the machine parallelism. The same problem occurs with
memory operations.
Number of FUs versus available ILP � The machine parallelism has to be adapted
to the parallelism that can be extracted from the targeted applications.
The choice of the amount of machine parallelism should take into account all the above
factors in order to obtain the best trade-o� between performance improvement and the hardware
overhead.
6.1.2 Code Memory
Chapter 5 showed the importance of code memory utilization in terms of power consumption
and performance. VLIW architectures, by their nature, strongly modify the interface between
the processor and the code memory. As a result, VLIW machines can incur in a big penalty
in terms of code size and memory bandwidth, directly a�ecting the circuit die area, the energy
consumption, the cost, and the instruction cache performance.
Originally, VLIW processors encoded in their instruction words the operations that each
functional unit should execute at the same time, resulting in the insertion of explicit no operation
instructions (NOPs) for unused functional units. These NOP insertions result in an increase in
code size, in memory bandwidth, and in the energy consumption of the code memory. Another
factor that a�ects the code memory utilization is the need for superscalar optimizations to
extract more parallelism from programs. Such optimizations generally imply a large amount of
code duplication (e.g, loop unrolling, tail duplication) resulting in a non-negligible increase in
code size.
Increase in code size � Code size directly a�ects the code memory die area, resulting
in an increase of the system cost. Figure 6.1 shows the die area as a function of the memory
size for an ultra-low-power embedded memory developed by the CSEM.
Power consumption is also correlated with the memory code size. Figure 6.2 illustrates
the relation between code size and power consumption.
Increase in memory width � Accessing a wider memory implies a greater energy
consumption. Figure 6.2 shows the increase in the energy required to access a 64-bit wide
6.2 De�nition of the DEVIL Processor 55
Figure 6.1: ROM code memory die area as a function of code size.
memory compared to a 32-bit one as a function of the memory size.
Increase in the number of accesses to the code memory � The energy consumption
of the code memory depends linearly (in the case of a static design) on the number of accesses
to the memory, which highlights the importance of reducing the tra�c between the processor
core and the memories.
In Chapters 4 and 5, several solutions have been described to reduce these negative e�ects
by including in the instruction word additional scheduling information. This mechanism (e.g.,
bundle formation in IA-64) implies an addition of extra hardware in order to dispatch the
instructions according to the encoded scheduling information. This approach trades o� the
inherent simplicity of a VLIW's fetch mechanism for a reduction of the instruction memory
overhead, while keeping the parallelism detection a lot simpler than the dynamic instruction
schedulers found in superscalar processors.
6.2 De�nition of the DEVIL Processor
DEVIL is a 32-bit VLIW machine that contains two ALUs, one branch unit, and one load/store
unit, implemented with a 3-stage pipeline. DEVIL can issue up to two instructions per cycle
with the restriction that neither two branch operations nor two load/store operations can be
parallelized together. Furthermore, DEVIL proposes a new encoding mechanism that combines
a NOP elimination technique (i.e., encodes scheduling information) with a variable instruction
length mechanism. Figure 6.3 depicts the block diagram of the DEVIL processor.
DEVIL targets the 32-bit mobile processors market and aims to be used as an ASIC
core. This imposes strong constraints in terms of system cost, circuit die area, and power
consumption.
56 The DEVIL Low-power Processor
Figure 6.2: Power consumption of the ROM code memory as a function of code size.
6.3 DEVIL's Registers
In order to support parallel execution, the register �le size requires a greater number of registers.
Scott Mahlke and al. [42] have shown that 16 registers are su�cient to exploit ILP in multiple-
issue machines with no performance loss due to the register pressure. DEVIL contains 16 32-bit
general purpose registers, like the majority of the current scalar mobile processors. DEVIL also
has some dedicated registers, called macro-registers. The DEVIL's available registers are:
� r0-r15: 32-bit general purpose registers,
� sp: 32-bit stack pointer (=r15),
� pc: 32-bit program counter,
� retaddr: 32-bit return address, used to save the pc during a jump to subroutine instruc-
tion, and also to restore the pc when a return from subroutine instruction is executed.
� retaddri: 32-bit return from interrupt address, used to save the pc while handling an
interruption, and also to restore the pc when a return from interruption instruction is
executed.
� sr: status register that contains the comparison �ag T and the current level of interrup-
tion.
6.4 DEVIL's Instruction Set
DEVIL's instruction set is based on a standard RISC instruction set, meaning that memory
operands can only be accessed using load/store instructions. Choosing a RISC-like approach
6.4 DEVIL's Instruction Set 57
shift
er ALU
shift
er ALU
Interrupt
Controller
PC
MARdata memory unit
address
code memory
data memory
data
macroreg.
unitcode memory
addr
ess
data
TT
right functional unitRU
left functional unitLU
regi
ster
file
16 3
2-bi
t reg
iste
rs
FETCH stage DECODE stage ALU/MEM/WB stage
disp
atch
er
deco
ders
Load/Store ops
ALU2 ops
ALU1 ops
Branch ops
32 64
32
32
Figure 6.3: Block diagram of the DEVIL architecture.
allows a simpler and faster pipeline, and simpli�es the introduction of a superscalar pipeline,
at the cost of a smaller code density.
In order to avoid this major drawback, DEVIL introduces a variable instruction length
mechanism similar to the one found in the ARM Thumb extension or in the TinyRISC. DEVIL
instructions can be either in 15-bit (short instruction) or in 30-bit format (large instruction).
Large instructions can encode large immediate values and o�er the possibility to specify a
destination register di�erent from the source. In short instructions the immediate value size is
limited and, for operations requiring two sources and one destination, the destination must be
the same as one of the sources.
The following subsections describe the features of DEVIL's instruction set. Appendix A
contains more detailed information about DEVIL's instructions.
6.4.1 Arithmetical Operations
DEVIL supports only simple 32-bit integer operations and does not include multiplication
and division instructions. The destination operand is always one of the 16 general purpose
registers, and the source operands can be either registers or immediate values. As a general
rule, short instructions use 5-bit immediate values and can only specify two operands, while
large instructions allow 16-bit immediate values and three operands. Furthermore, some large
operations can shift one source operand and be conditionally executed depending on the T �ag
with no overhead. Table A.1 describes DEVIL's ALU operations.
58 The DEVIL Low-power Processor
6.4.2 Logical Operations
Logical operations are described in Table A.2, and can be classi�ed in two categories: (1)
logical operations between one register and one immediate value; (2) logical operations between
registers.
The logical operations with immediate values are only available as large instructions. The
immediate values are 16-bit wide, which implies that such operations work on half-words (the
other half remaining unchanged). An instruction extension .l or .h indicates whether the opera-
tion applies to the least signi�cant, respectively the two most signi�cant bytes. These operations
are particularly useful for bit �eld manipulations.
Logical operations between registers can be speci�ed as either short or large operations.
In large operations three operands can be speci�ed, instead of two for short operations. Large
instructions can also shift one source operand and be conditionally executed depending on the
T �ag with no overhead.
6.4.3 Compare Operations
DEVIL's instruction set contains only �ve of the ten standard integer comparison operations.
The remaining �ve conditions are obtained by using the inverse of the comparison �ag (T).
For example, conditional branch instructions can jump either if the comparison is true or if
it is false, allowing all kinds of conditional jumps (see subsection 6.4.5). This method reduces
the number of comparison operations from 20 to 10, and is also used in the Motorola M�Core
family.
The result of a comparison operation is always stored in the T macro register. The com-
parison can be made between registers or between an immediate value and a register. Short
operations support 5-bit immediate values that could be signed or unsigned depending on the
type of comparison. Large operations support up to 20-bit immediate values. Furthermore, in
the 30-bit format, comparisons between registers can shift one operand and to be conditionally
executed.
In addition to these comparison instructions, there is also a bit test operation that copies
the tested bit in the T �ag. Table A.3 resumes DEVIL's comparison instructions.
6.4.4 Move Operations
Table A.4 resumes the set of move operations available in the DEVIL architecture. Obviously,
DEVIL has a standard mov operation. It is also possible to load a 6-bit or a 20-bit (depending
on the instruction size) signed immediate value into a register using the ldi instruction. If a
register needs to be loaded with an immediate value larger than 20 bits, the most signi�cant
part can be loaded with an ori.h operation (see subsection 6.4.2).
There is also a set of move instructions that allows data to be exchanged between the
register �le and the macro register �le, for example to allow the return address register for
example to be saved.
The conditional move operations add a partial-predication support to DEVIL, that can be
used to reduce the penalty due to branches and also, in some cases, to avoid the tail duplication
during superblock formation.
6.4.5 Branch Operations
Table A.5 describes the branch instructions of the DEVIL processor.
6.5 The DEVIL Instruction Fetch Mechanism 59
Branch targets can be speci�ed as a displacement relative to the Program Counter or as
the contents of a register. The displacement is a 10-bit signed value for the short instruction
format or a 25-bit signed value for the large instruction format.
The outcome of conditional branches depends on the value of the �ag T, and the branch
can be taken either when T is set or when T is cleared. This double state sensitivity is necessary
because DEVIL's compare operations only implement one half of all possible comparisons.
Furthermore, conditional branches specify whether or not instructions in the delay slot
have to be nulli�ed depending on the issue of the branch. Due to this nullify mechanism,
compiler static branch predictions can be done at a negligible hardware cost. Subsection 6.6.4
provides more information about the use, the e�ciency, and the negative e�ects of this static
branch prediction mechanism.
6.4.6 Data Memory Operations
DEVIL allows memory to be accessed only via load/store instructions. Table A.6 shows the
supported load/store operations, and their corresponding addressing modes.
Short operations allow only simple addressing modes: (1) the register indirect mode, that
uses the content of a register to address the data memory, and (2) the stack pointer relative
mode, that is used to address the elements that are in the stack frame. The memory location
is computed by adding a 5-bit displacement to the stack pointer. This mode is often used for
spill/�ll code.
Large operations extend the available addressing modes. The register indirect mode be-
comes a register indirect plus register o�set mode, and the stack pointer relative mode becomes
register plus displacement mode, where the displacement �eld is 16-bit wide. Furthermore, a
new mode is introduced that allows to address directly a data memory position with a label.
6.5 The DEVIL Instruction Fetch Mechanism
DEVIL's instruction fetch mechanism is designed to deliver the high level of performance of
a 2-issue processor while keeping code size and memory bandwidth to a minimum. To do so,
DEVIL supports a variable instruction length mechanism in conjunction with a NOP elimination
technique.
DEVIL fetches a 64-bit instruction bundle that is divided in �ve di�erent parts:
� tag: A 4-bit instruction tag that encodes instruction width information and instruction
scheduling.
� s0: 15 bits that can contain either a 15-bit instruction or the most signi�cant half of a
30-bit instruction.
� s1: 15 bits that can contain either a 15-bit instruction or the most signi�cant half of a
30-bit instruction.
� s2: 15 bits that can contain either a 15-bit instruction or the least signi�cant half of the
second 30-bit instruction in the bundle.
� s3: 15 bits that can contain either a 15-bit instruction or the least signi�cant half of the
�rst 30-bit instruction in the bundle.
60 The DEVIL Low-power Processor
This subdivision allows to encode in the instruction bundle a mix of short (15 bits) and
large (30 bits) instructions that can be executed at di�erent cycles. The 4-bit tag encodes the
size of the di�erent instructions as well as their scheduling information. Table 6.1 resumes the
di�erent mode of execution of a DEVIL's instruction bundle. For example, when the fetch unit
decodes Tag = 1011, it sends at time 0 a short instruction composed of the 15 bits of s0 to slot
0, then sends at time 1 one large instruction made with the concatenation of s1 and s3 to slot
0 in parallel with a short instruction composed of s2 to slot 1.
Tag Slot 0 Slot 1 Time
0000 s0 + s3 (large) s1 + s2 (large) 0
0001 s0 (short) nop 0
s1 + s3 (large) nop 1
s2 (short) nop 2
0010 s0 (short) s1 + s3 (large) 0
s2 (short) nop 1
0011 s0 (short) nop 0
s1 (short) nop 1
s2 + s3 (large) nop 2
0100 s0 (short) nop 0
s1 (short) nop 1
s2 (short) nop 2
s3 (short) nop 3
0101 s0 (short) s1 (short) 0
s2 (short) nop 1
s3 (short) nop 2
0110 s0 + s3 (large) nop 0
s1 + s2 (large) nop 1
0111 s0 + s3 (large) nop 0
s1 (short) s2 (short) 1
1000 s0 + s3 (large) nop 0
s1 (short) nop 1
s2 (short) nop 2
1001 s0 (short) nop 0
s1 (short) s2 (short) 1
s3 (short) nop 2
1010 s0 (short) nop 0
s1 (short) s2 + s3 (large) 1
1011 s0 (short) nop 0
s1 + s3 (large) s2 (short) 1
1100 s0 (short) nop 0
s1 (short) nop 1
s2 + s3 (large) nop 2
1101 s0 (short) nop 0
s1 (short) nop 1
s2 (short) s3 (short) 2
1110 s0 (short) s1 (short) 0
s2 (short) s3 (short) 1
1111 s0 + s3 (large) s1 (short) 0
s2 (short) nop 1
Table 6.1: Execution modes of DEVIL's instruction bundles.
6.6 DEVIL's Pipeline 61
Figure 6.4 shows how bundles can be formed from scheduled assembly code. This example
shows an interesting case that illustrates how alignment problems can be solved thanks to the
fact that short operations represent a subset of large operations. Another interesting fact is that
the next bundle is fetched only once all the operations of the current bundle are issued, meaning
that sometimes bundles should be �lled with NOPs so as to bundle together the operations
that are scheduled at the same time. When code size is more important than performance, this
constraint can be removed at the cost of decreased performance.
Extension
Promoted to large operation to fill bundle
instr. 2 (L)instr. 3 (S)instr. 2 (L)
instr. 4 (L)instr. 5 (L)
Extension
1111 instr. 7 (L)
instr. 4 (L)0000
instr. 9 (S)
Extension
instr. 1 (S)
instr. 5 (L)
0000 instr. 6 (L) instr. 6 (L)nop (L) nop (L)
instr. 7 (L)
Extension
tag s0 s1 s2 s3
0010
instr. 8 (S)
Insertion of a large nop operation to fill bundle
Slot 0
instr. 1 (S)
instr. 4 (L)
instr. 6 (L)
instr. 7 (L)
Slot 1
instr. 2 (L)
instr. 5 (S)
instr. 8 (S)
instr. 9 (S)
instr. 3 (S)
0
1
2
3
5
4
Sche
dul
ing
info
rma
tion
L = large operationS = short operation
Figure 6.4: Instruction bundle formation in the DEVIL processor.
6.6 DEVIL's Pipeline
DEVIL's architecture is based on a simple 3-stage pipeline. This choice was made to avoid the
extra logic and buses that are required to bypass operands in deeper pipelines. Indeed, in a
VLIW architecture several units can provide a result at each clock cycle and consequently the
bypass logic should be duplicated. In the case of the DEVIL architecture, three results can be
written at the same time, requiring three bypass subsystems. Using a 3-stage pipeline avoids
this circuit overhead.
Furthermore, deepening the pipeline in superscalar datapath increases the number of
penalty cycles for mispredicted branches, implying the need for a more e�cient (and thus more
complex) branch prediction mechanism. DEVIL has a simple branch prediction mechanism
(see 6.6.4) that sometimes requires code duplication. Increasing the delay slot size will result
in greater code expansion, probably forcing the addition of a Branch Target Bu�er (BTB).
The remainder of this section describes the pipelined execution of the di�erent types of
instructions.
6.6.1 Pipelined Execution for ALU Operations
The execution of an ALU operation is decomposed into three stages: Fetch, Decode, ALU-WB.
The instruction fetch occurs at cycle T1. Cycle T2 is used to decode the instruction and to read
62 The DEVIL Low-power Processor
the operands in the register �le. In the last cycle, T3, the ALU operation is executed and the
result is written into the register �le at the end of the second half of T3.
T1 T2 T3 T4 T5
CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LLL
Instr.1 ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB
Instr.2 ...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB
Instr.3 ...................�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB
Figure 6.5: DEVIL's pipeline: ALU operations.
6.6.2 Pipelined Execution for Memory Operations
Figure 6.6 shows the 4-cycle pipeline execution of a memory operation. Memory operations
require one cycle more than ALU operations because of the address computation. During cycle
T1 the instruction is fetched. In cycle T2 the instruction is decoded and the register �le is
accessed. Phase T3 is used to compute the address of the memory access. The memory access
and the writeback (in case of a load operation) are made in cycle T4.
T1 T2 T3 T4
CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�L
Mem op ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VV...Fetch Decode Addr Mem-WB
Instr.1 ...........�VVVVVV�VVVVVV�VVVVVV�VV...Fetch Decode ALU-WB
Figure 6.6: DEVIL's pipeline: memory operations.
This extra cycle of latency adds a load delay slot, meaning that instructions (e.g., Instr.1)
that immediately follow a load operation can not have the destination register of the load as
source operand. When it is not possible to move such an instruction in the delay slot, a NOP
should be inserted. Furthermore, the writeback of a load operation is made at the same time
than instructions scheduled in the next cycle, resulting in a potential resource con�ict or the
need to add an extra register �le write port. This latter solution has been used in the DEVIL
implementation.
6.6 DEVIL's Pipeline 63
6.6.3 Pipelined Execution for Branch Operations
Branch operations have a three-cycle execution time. In the �rst stage the branch instruction
is fetched. Then, during decoding, the next PC is computed, allowing the correct instruction
fetch to be executed in the following cycle. Therefore, there is a one-cycle branch delay slot,
implying a branch misprediction penalty of one cycle. A last phase is used to save the PC in
the retaddr macro register when a jump subroutine is executed.
T1 T2 T3 T4 T5 T6
CLK LLL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LL�HH�LLL
test.cc ...�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB
jt/jnt_nn/nt...........�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Dec-PC Save PC
Instr.3 ...................�UUUUUU�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB
Instr.4 ...........................�VVVVVV�VVVVVV�VVVVVV�VVVV...Fetch Decode ALU-WB
Figure 6.7: DEVIL's pipeline: conditional branch operations.
Figure 6.7 shows the pipelined execution of a conditional branch instruction. At cycle T1a comparison instruction is fetched and used to compute the result of the comparison during
cycle T3. The conditional branch instruction is fetched at cycle T2, and during cycle T3 the
branch operation is decoded and the new PC is computed according to the comparison's result
(computed in parallel). Furthermore, during this phase of execution the processor decides
whether it should nullify the instruction fetched during T3 or not. This decision is made
according to the comparison's result and the branch prediction information (_nt= nullify
taken, _nn= nullify not taken). Finally, during cycle T4 the PC can be saved into the retaddr
macro register if needed.
6.6.4 DEVIL's Branch Prediction Mechanism
DEVIL o�ers a simple mechanism for static branch prediction, allowing the conditional execu-
tion of instructions in the branch delay slot. The conditional branch instruction format allows
to specify whether the operations in the delay slot should be nulli�ed when the branch is taken
or when the branch is not taken. Figure 6.8 shows how the compiler can do branch prediction
using this mechanism and pro�ling information.
Figure 6.9 shows the bene�ts of DEVIL's compile-time branch predictor in terms of per-
formance. Although this branch prediction technique has a negligible hardware overhead, the
major drawback is that, when the branch is predicted taken, the compiler should duplicate code
into the delay slot, resulting in code expansion.
This code expansion is due to the delay slot inserted by the branch operations. This delay
slot can be avoided if a Branch Target Bu�er (BTB) is added to the branch unit. The BTB
64 The DEVIL Low-power Processor
st.32 r1, r2, r0add r0, r0, #1
jt_nn beginning
ld r1, #base_arrayldi r0, #0
shl r2, r0, #2
shl r2, r0, #2
rts
Nullified whentest.lt r0, #128
rts
not taken
shl r2, r0, #2add r0, r0, #1st.32 r1, r2, r0test.lt r0, #128
jt_nt beginning
ld r1, #base_arrayldi r0, #0
Nullified whentaken
Figure 6.8: DEVIL's branch prediction mechanism.
Figure 6.9: DEVIL's compile-time branch prediction bene�ts.
is an associative cache that stores the addresses of the branches and their predicted outcome.
The �rst time a branch operation is fetched, the branch address will not match any entry of
the BTB, and the next sequential instructions are fetched. Once the address destinations are
known, the compile-time predicted address is stored in the BTB with the corresponding branch
address. The next time the branch is fetched, the BTB will match the branch address with one
of its entries and will return the predicted next address, avoiding the delay slot.
6.7 Evaluation of the DEVIL Architecture
This section contains an evaluation of the DEVIL processor in terms of both performance and
code size utilization.
6.7 Evaluation of the DEVIL Architecture 65
6.7.1 Experimental Setup
The IMPACT framework [7] was used to obtain accurate estimates of the processor's perfor-
mance, code size and memory utilization. IMPACT is a compiler framework developed at
the University of Illinois at Urbana-Champaign to study the new generation of ILP compilers.
Figure 6.10 shows the block diagram of the IMPACT framework. There are �ve main parts:
(1) the front-end, (2) the machine-independent optimizer, (3) the back-end, (4) the machine
description, and (5) the emulator/simulator.
Emulator Simulator
Statistics
Pcode
Hcode
C program
HtoL
Fro
nt-E
nd
LcodeMachine
IndependentOptimizer
Back-endDescriptionMachine
Assembly code
Memory DesambiguationProfiling
Inlining
Standard OptimizationsSuperscalar OptimizationsSuperblock FormationHyperblock FormationProfiling
Phase 1: Map Lcode to Machine Instructions
Register AllocationInstruction Scheduling
Phase 3: Assembly Code Generation
Machine Dependent OptimizationsPhase 2:
Figure 6.10: The IMPACT compiler framework.
The front-end translates a program written in C into an intermediate representation called
Lcode. The C program is �rst converted into Pcode, another intermediate format, that is used
to do a �rst pro�ling, memory disambiguation, array analysis, and code inlining. Once such
steps are done, the Pcode is converted to Hcode and �nally to Lcode via the HtoL converter.
Lcode is the internal representation used in the machine-independent optimizer and it
looks like an extended RISC-like instruction set. The machine-independent optimizer includes
all the standard compiler optimizations plus a large set of superscalar optimizations such as
superblock formation, hyperblock formation (i.e. predication), loop unrolling, etc. Lcode can
also be pro�led using the Lcode emulator (Lemulate). At this level, pro�ling plays an important
role because the majority of the optimizations use pro�ling informations.
Once the Lcode is optimized, the back-end converts Lcode into a machine dependent
assembly language using a three-phase algorithm. The �rst phase annotates Lcode into assembly
instructions that are compatible with the targeted machine. The second phase consist of the
register allocator, the scheduler, and the machine dependent code optimizer. Finally, the third
phase generates the assembly �le.
The machine description describes the targeted architecture in terms of instruction operands,
66 The DEVIL Low-power Processor
Benchmarks Description Benchmark Suite
008.espresso Generates and optimizes Programmable Logic Arrays SpecINT92
023.eqntott Translates a logical representation of a Boolean equation to a truth table
052.alvinn Trains a neural network using back propagation
129.compress Compresses and decompresses �le in memory SpecINT95
130.li LISP interpreter
132.ijpeg Graphic compression and decompression
decode CCITT G.711, G.721 and G.723 voice compressions decoder Mediabench
encode CCITT G.711, G.721 and G.723 voice compressions encoder
gsmencode GSM 06.10 provisional standard for full-rate speech transcoding
mpeg2dec Video MPEG-2 decoder
mpeg2enc Video MPEG-2 encoder
rawcaudio ADPCM speech compression algorithm
rawdaudio ADPCM speech decompression algorithm
dhrystone Dhrystone v2.1 Dhrystone
�b Compute Fibonacci numbers
�r FIR �lter
wc Word Count UNIX utility
Table 6.2: Benchmark list
instruction latencies, resource utilization, and pipeline execution. Such information is required
in particular for code scheduling.
The emulator and the simulator are used for code pro�ling and to extract statistics such
as performance and memory utilization. The emulator can probe the code for pro�ling or to
generate an execution trace that can be sent to the trace-driven simulator.
All these IMPACT modules are widely parameterizable via the use of parameter �les,
allowing the di�erent compiler functions to be enabled/disabled as well the di�erent parameters
to be �ne-tuned.
The IMPACT framework was enhanced to generate code for the DEVIL instruction set,
so as to evaluate the impact of the di�erent architectural choices (the parts that were modi�ed
are shaded in Figure 6.10). At the front-end level, the HtoL converter has been modi�ed to
generate library function calls for the unsupported operations such as �oating point operations
or integer division. Also, a new back-end has been built to generate optimized code for DEVIL,
including several machine descriptions. Furthermore, the Lcode emulator has been modi�ed to
emulate DEVIL's code.
All the results presented in this chapter are derived from information obtained through
dynamic emulation of the code. Note also that the conditional execution modes and free-shifting
operand possibilities were not used, meaning that the compiler can potentially generate better
code in terms of both performance and memory utilization.
6.7.2 Benchmarks
All the results presented is this chapter were obtained on a selection of programs extracted
mainly from the the SpecINT92, SpecINT95, and Mediabench [35] benchmark suites. These
di�erent types of benchmarks were chosen as representative for a wide range of applications.
Table 6.2 brie�y describes this selection of benchmarks. The Mediabench benchmarks represent
multimedia programs that can be found in embedded systems (e.g. GSM encoder), while the
SpecINT suites represent non-numerical applications. Some other smaller applications were
added, such as FIR �lter, used in several embedded programs. This variety of applications
shows the sensitivity of the results to the type of program.
6.7 Evaluation of the DEVIL Architecture 67
6.7.3 DEVIL's Performance
Figure 6.11 shows the performance of the DEVIL processor without superscalar optimization
(DEVIL O) and with superscalar optimization (DEVIL S), as well as performance of a 4-
issue processor that executes the DEVIL's instruction set and that has one branch unit, one
load/store unit, and two ALUs. Superscalar optimizations were used to generate the code for
the 4-issue processor. All results are relative to the best performance that can be achieved with
a single-issue processor that executes the DEVIL instruction set.
Figure 6.11: DEVIL performance with and without superscalar optimizations compared to
1-issue and 4-issue architectures.
E�ect of superscalar optimizations � Figure 6.11 shows the importance of super-
scalar optimization to extract parallelism from programs. On average, DEVIL with superscalar
optimization is 30% faster than without superscalar optimization, meaning that superscalar
optimizations must be used to achieve a signi�cant speed-up. However, as will be shown later,
such optimizations have a negative e�ect on code size.
DEVIL's speed-up � The multiple-issue pipeline introduced in the DEVIL architecture
increases the performance from 29% to 78% (50% in average) with respect to a scalar machine
using the same instruction set. This speed-up will allow a signi�cant voltage and clock frequency
reduction. These bene�ts are investigated in Chapter 7.
E�ect of the limitation of the number of issued operations per cycle � The
DEVIL instruction fetch mechanism limits the number of issued instructions per cycle to two,
even if DEVIL contains four units. This choice was mainly motivated by code compaction
issues. Figure 6.11 shows that reducing the number of issued instructions from four to two
reduces the performance of 5% on average.
It is interesting to note that there is no signi�cant change in the results between the
di�erent benchmark suites, meaning that the results are not sensitive to the type of application.
The performance comparison graph (Figure 6.11) illustrates the need to use superscalar
optimizations in order to extract a good level of performance.
68 The DEVIL Low-power Processor
Figure 6.12: E�ect of superscalar optimizations on code size.
Figure 6.13: E�ect of superscalar optimizations on the number of accesses to the code memory.
6.7 Evaluation of the DEVIL Architecture 69
Figures 6.12 and 6.13 depict how the memory utilization is a�ected by this kind of op-
timizations. The increase in code size due to superscalar optimization is of 58% on average,
the worst case being 052.alvinn that exhibits an increase in code size by a factor of 3.4. On
average, superscalar optimizations do not modify signi�cantly the number of accesses to the
code memory.
6.7.4 DEVIL's memory utilization
The previous subsection showed that the use of VLIW machine implies a penalty in terms of
memory utilization. In order to reduce this negative e�ect, DEVIL's architecture o�ers an
instruction fetch mechanism that includes NOP elimination and variable instruction length
support. This subsection quanti�es the e�ect of such techniques on memory utilization. Note
that an important feature of these mechanisms is that they do not a�ect performance.
Figures 6.14 and 6.15 show the bene�ts, in terms of code size and number memory accesses,
of the NOP elimination technique. To obtain these measures, the variable instruction length
mechanism was disabled so that only the large operations of the DEVIL's instruction set were
available. The targeted architecture is a 2-issue machine that fetches a 64-bit instruction word
including two DEVIL's large operations and a tag that encodes the scheduling information
necessary for the NOP elimination. The �gures are relative to the same 2-issue machine without
NOP elimination support.
Figure 6.14 shows that the NOP elimination mechanism reduces the code size by 27% on
average. Figure 6.15 shows that the number of accesses to the code memory is decreased by 20%
on average. These results show the importance of eliminating unnecessary NOP instructions.
Figure 6.14: E�ect of NOP elimination on code size.
The same kind of experiment was run to quantify the e�ciency of the variable instruction
length mechanism. To obtain this results, a 2-issue machine that includes the NOP elimination
70 The DEVIL Low-power Processor
Figure 6.15: E�ect of NOP elimination on the number of accesses to the code memory.
technique but can only execute DEVIL's large operations is compared to the DEVIL processor,
that includes both NOP elimination and variable-length mechanism.
Figures 6.16 and 6.17 summarize the results that were obtained and show that the variable
instruction length mechanism allows a saving of 20% to 30% (26% on average) of the code size.
Furthermore, the number of accesses to the code memory is reduced by 20%.
To summarize the e�ciency of the DEVIL instruction fetch mechanism, Figures 6.18 and 6.19
show the memory utilization of the DEVIL processor as compared to a 2-issue machine with no
NOP elimination and with the ability to execute only DEVIL's large operations. All the results
presented here are relative to the code size of a scalar architecture that executes only DEVIL's
large operations.
These results show that DEVIL has a code size on average 22% smaller than a scalar
processor that executes DEVIL's large operations. Compared to the standard 2-issue VLIW,
the DEVIL instruction fetch mechanism allows to save 47% of code size on average.
Figure 6.19 shows the number of accesses to the code memory. As the number of accesses to
the code memory is independent of the bus width, these numbers should be weighted, knowing
that the bus of the scalar processor is 32-bit wide and the buses of the VLIW architectures is
64-bit wide. The comparison between the two VLIW architectures shows that DEVIL's fetch
mechanism allows a reduction of the number of accesses by 36% on average. As compared to
the scalar processor, DEVIL's number of accesses to the code memory decreases from 50% to
75%, but the DEVIL's instruction width is twice as large as the scalar processor's. Therefore,
DEVIL has an average reduction of 16% in terms of number of accessed bytes.
6.7 Evaluation of the DEVIL Architecture 71
Figure 6.16: E�ect of the variable instruction length mechanism on code size.
Figure 6.17: E�ect of the variable instruction length mechanism on number of accesses to the
code memory.
72 The DEVIL Low-power Processor
Figure 6.18: E�ect of the DEVIL instruction fetch mechanism on the code size.
Figure 6.19: E�ect of the DEVIL instruction fetch mechanism on the number of accesses to the
code memory.
6.8 Comparison With Existing Mobile Processors 73
6.8 Comparison With Existing Mobile Processors
In this section we wish to compare DEVIL and existing mobile processors in terms of instruction
set and code size. Such a comparison is of course di�cult, as DEVIL is currently only a
prototype and can be much more optimized. Furthermore, the tools (i.e., the compiler) used for
generating code are not the same for each processor, meaning that the quality of the generated
code depends not only on the architectural features but also on the quality of tools. Therefore,
the goal of this section is not to once and for all whether DEVIL is better than other processors,
since the comparison is not fair at this stage. However, this analysis provides an insight on how
DEVIL's features can be situated with respect to current processors. Also, this comparison
allows to highlight the original points of the DEVIL architecture.
6.8.1 Instruction Set Comparison
Compared to the state of the art in mobile processors (see Chapter 4), DEVIL o�ers several
new features. First of all, DEVIL bundles explicitly encoded parallel operations, while current
mobile processors o�er only sequential instruction representation. Even in the SH-4 architecture
instructions are still sequential and the parallelization is made by a hardware scheduler, implying
a large hardware overhead. Second, DEVIL has a variable operation length encoding that
supports 15-bit and 30-bit instruction lengths. This technique is similar to the ARM Thumb or
TinyRisc operation encoding, but there are several di�erences, and notably that performance is
not decreased compared to a processor with a �xed instruction set. ARM and TinyRisc reduce
the code size at the cost of degrading performance. This is mainly due to the fact that DEVIL
allows the mixture of short and large operations with no restrictions, while with Thumb or
TinyRisc the processor has to choose between executing large or short instructions. A special
branch operation controls the mode of execution. Also, DEVIL's short instructions can access
all of the 16 registers, while in the short execution mode Thumb and TinyRisc allow only a
subset of the entire set of registers to be accessed.
Existing and future VLIW architectures also o�er an instruction fetch mechanism that
compacts the VLIW instruction word. The HP/Intel IA-64 includes a bundle formation mech-
anism that explicitly encodes parallelism into the instruction bundle, eliminating the NOP
insertion required in original VLIW architecture. The TMS320C6201 has a similar mechanism.
Both solution are implemented for a �xed instruction length. DEVIL extends this concept by
introducing a variable instruction length encoding within a bundle. The future DSP StarCore,
will use an approach similar to DEVIL's, with 16-bit instructions in conjunction with instruc-
tion pre�xing (allowing to extend instruction length), and with parallelism encoded within
an instruction packet. However, no precise information is available to date on how this fetch
mechanism works. This industrial development, however, supports our design decision.
6.8.2 Code Size Comparison
Figure 6.20 shows a comparison of the relative average code density of several processors for
the benchmarks described in section 6.7.2. The results presented for the market processors
were generated using the GNU gcc compiler with level 3 optimization (-O3). The results for
DEVIL are generated with the IMPACT compiler. The results correspond to the average code
expansion of the processors compared with the code generated for the DEVIL architecture
without superscalar optimization (i.e. devil (O)). The devil (S) measurement shows the code
size when superscalar optimizations were used.
74 The DEVIL Low-power Processor
The DEVIL variable length instruction set allows the compiler to generate quite compact
code. Indeed, the code size of DEVIL, when the compiler does not apply superscalar optimiza-
tions, is around 18% better than the ARM7, 30% better than the i386, and 10% better than
SH. However, DEVIL's code is about 20% to 25% larger than the code of Thumb and M-core.
The larger code can be explained by the bundle �lling that is required to group instructions into
a single bundle. These results show that the DEVIL instruction set is well designed in terms of
code density. It should be noted that the IMPACT compiler was not optimized for minimum
code size and that, for the moment, the conditional operation and free-shifting operands are
not used. Therefore, future compiler development may further improve these results. Another
important point is that DEVIL lacks the move multiple1 operation that can potentially save
code size when applied to spill and �ll code insertion.
Figure 6.20: Code size comparison between DEVIL and some other mobile processors.
When the compiler applies superscalar optimizations, there is a 58% increase in code size.
This result illustrates the cost of using a VLIW-like architecture. Note that, this increase in
code size due to superscalar optimizations can be reduced, at the cost of decreased performance.
6.9 Conclusion
This chapter de�ned a new VLIW architecture called DEVIL, targeted for the mobile processor
market. The architectural decisions were motivated and evaluated with an enhanced IMPACT
compiler. DEVIL o�ers an instruction fetch mechanism that allows to encode explicitly the
parallelism within an instruction bundle and to support variable instruction lengths. It was
shown that such mechanism allows savings of 50% of the code size with respect to a standard
VLIW processor, with no impact on performance. A signi�cant reduction of the number of
accesses to the code memory was also observed.
In terms of performance, DEVIL speeds up the execution by a factor of 1.5 on average as
compared to a scalar processor. This performance enhancement allows lower frequencies and
power supply voltages to be used, reducing the circuit's power consumption.
1This operation allows to specify several load/store registers in the stack frame in one unique instruction.
6.9 Conclusion 75
A comparison was made between DEVIL and current mobile processors, in order to roughly
determine where the DEVIL features can be situated. DEVIL o�ers an instruction set that
allows a good code density, while o�ering a parallel operation representation. However, when
superscalar optimizations are used, there is a large code size penalty. The e�ects of code size
expansion are minimized thanks to the compaction technique o�ered by the DEVIL architecture.
The next chapter describes the VLSI implementation of the DEVIL processor, allowing a
good estimation of its features in terms of complexity, circuit speed, and power consumption.
76 The DEVIL Low-power Processor
Chapter 7
Implementation of
the DEVIL Processor
The DEVIL processor has been de�ned in the previous chapter and has been evaluated at
the architectural level. It was shown that the introduction of the parallelism speeds up
the execution time by a factor of 1.5 on average as compared to the scalar architecture. This
speed-up can be used to compensate the loss of performance due to a low-power execution
mode (i.e. low clock frequency and low power supply). However, the hardware cost due to
the introduction of a multiple issue pipeline has not been evaluated. This potential increase in
complexity could nullify the bene�ts of parallelism.
This chapter describes the implementation of the DEVIL processor in order to estimate
design features such as complexity, circuit speed, and circuit power consumption. The DEVIL
processor was implemented using a hardware description language and synthesized with a low-
power technology. The following section gives details on the design methodology, the DEVIL
implementation, and the DEVIL features.
7.1 Technology and Synthesis Methodology
The DEVIL processor was implemented using the VHDL hardware description language and
was synthesized using the Synopsys 1998.08 tool. The synthesis targeted the CSL 4.1 low-
power library developed by XEMICS1, characterized for circuit delays and power consumption
estimation at 1.6 volts slow-slow (worst case), and is mapped on a TSMC 0.25 �m technology.
Synthesis methodology approaches have several advantages: reduced design time, fewer
resource requirements, quick migration to di�erent technologies, and the possibility to market
the design as intellectual property (IP), which is the current trend in the market. However, J.
Scott and al. [57] showed that when moving from a custom to a synthesized adder, transistor
count increased by 60%, area increased by 175%, and power consumption increased by 40%.
To counter these e�ects, the VHDL description of DEVIL was made at a low level, close to
a structural description. Nevertheless, it should be noted that a full custom design can be
optimized much more thoroughly.
1http://www.xemics.ch
77
78
Implementation ofthe DEVIL Processor
7.2 Design Methodology
DEVIL targets the low-power mobile processor market and aims to be used as an ASIC core.
This implies a simple and fast synthesis methodology, and the possibility to work at di�erent
power supply voltages in order to meet power consumption requirements by working at a low
voltage.
One of the most sensitive elements in a microprocessor design is generally the system-wide
clock tree that must meet strong timing constraints in order to avoid clock skew problems. This
phenomenon gets worse with deep submicron technologies, generally requiring an optimization
of the clock tree by hand and huge clock line bu�ers that increase power consumption. For
example, in the M-core [57], clock power represents around 36% of the total power dissipation.
These design constraints are directly opposed to the goals of the DEVIL project. To address
these issues, the DEVIL implementation is based on a non-overlapping dual-phase system clock
used in conjunction with latches and an aggressive clock gating. DEVIL does not contain any
�ip-�op elements.
A dual phase, non-overlapping system clock is the most robust scheme available to avoid
system-wide clock skew problems. It is always possible to �nd a clock frequency for which the
design works correctly [77], even at di�erent power supply voltages. In a conventional single-
clock and �ip-�op system, a design working at 3 volts may not work at 2 volts because of clock
skew problem even if the clock frequency is scaled down. Therefore, dual phase, non-overlapping
clock systems o�er a great advantage for IP, where the core must be synthesizable for di�erent
applications and power supply voltages, and reduce the need of huge clock line drivers to meet
clock timing requirements in conventional designs.
Gating clocks is an e�cient way to save power: every unnecessary, power consuming
signal transition can be prevented. This approach is quite e�cient for large buses in the chip's
datapath and is particularly adapted to VLIW architectures that include duplicated datapaths
with signi�cant idle times. Furthermore, gated latch techniques can be easily integrated in a
dual phase, non-overlapping system.
Figure 7.1 illustrates the DEVIL system clock based on dual phase, non-overlapping clocks.
Figure 7.1(a) shows how the two non-overlapping clocks CLK1 and CLK2 are built from a clock
signal (Fast CLK) that has a frequency double that of the original pipeline clock (Orig. CLK).
When a clock skew problem appears, it can be solved by decreasing the frequency of Orig. CLK.
This results in an increase of the non-overlapping time TNOT , i.e., of the clock skew tolerance.
With this clocking scheme the DEVIL 3-stage pipeline is divided in two substages that are
separated by latches. These latches are synchronized on CLK2, while the inputs of each stage
are synchronized on CLK1. Each substage has now a maximum critical path included between
one quarter of the Orig. CLK cycle (TMIN ) and three quarters of the Orig. CLK cycle (TMAX),
depending on whether the substage can borrow time to its neighbors (see Figure 7.1(b)). This
can result in a better balance of the pipeline timing.
Figure 7.1(b) shows the implementation of a pipeline stage. The inputs of the �rst pipeline
substage (the Latch1 output) are guaranteed to have settled by the time CLK1 goes low. The
outputs of that block must have settled by the time CLK2 goes low for the proper values to be
stored in the Latch2. When Latch 2 is open (note that when Latch2 is open, Latch1 is always
closed), the second substage begins its computation. Figure 7.1(b) also shows the clock gating
implementation where a given pipeline stage can be controlled by the previous stage. In the
DEVIL pipeline, for example, a dirty bit is used to indicate if instructions in the pipeline are
valid or not. When the instruction is not valid (pipeline bubble), this dirty bit directly gates
clocks of the next pipeline stage.
7.3 The DEVIL Latch-Based Pipeline 79
LogicBlock
LogicControl
Latch1 Latch3Latch2
CLK1 CLK2 CLK1
gate2 gate3gate1
Tmax
Tmin
LogicControl
BlockLogic
Orig CLK
Fast CLK
Fetch0 Fetch1 Dec0 Dec1 Exec0 Exec1
WB
CLK1
CLK2
(b)
(a)
non-overlapping time
Figure 7.1: A two-phase non overlapping pipeline using latches.
Designing a latch based architecture is quite unconventional and the design has to be care-
fully conceived from the bottom up as a dual-clock latch-based design. It is not recommended
to simply transform a register-based design into a latch-based design by replacing each register
by two latches.
7.3 The DEVIL Latch-Based Pipeline
The DEVIL's pipeline is mapped to the double clock structure by dividing each of its pipeline
stages into two functional parts. Figure 7.2 shows the execution of several instructions in the
DEVIL dual-phase pipeline.
The �rst group of operations (� and �) illustrates the execution of two consecutive ALU
operations. The writeback of the result of operation � is made at the same time as operation �
requires its source operands. In some cases this requires the data to be bypassed. However,
as DEVIL is based on a latch implementation, the bypass is made directly through the latch
elements of the register �le, avoiding any kind of bypass logic.
Operations � to � illustrate the execution of a conditional branch instruction. Instruc-
tion � computes a comparison and stores the result in the T �ag. The conditional branch is
fetched right after operation �. During the phase DEC0 of operation �, the conditional branch
is detected and the information is sent to the DEC1 stage (PC1). The DEC1 stage will set the
PC according to the branch outcome de�ned by the value of the T �ag.
The remaining operations illustrate a pipeline stall due to a data memory access that
requires a one-cycle wait state (operation �).
80
Implementation ofthe DEVIL Processor
DEC1
FETCH1PC0
DEC0
FETCH0
PC1
ALU0
WB
ALU1
ALU0
DEC1
DEC0
PC0FETCH0
FETCH1PC1
DEC0
FETCH1PC1
FETCH0PC0
WB
ALU1
ALU0
DEC1
DEC0
PC0FETCH0
FETCH1
WB
ALU1
ALU0
DEC1
DEC0
PC0FETCH0
FETCH1PC1
DEC1
DEC0
PC0FETCH0
FETCH1PC1
ALU1
PC1
WB
ALU1
ALU0
DEC1
DEC0
PC0FETCH0
FETCH1PC1
DEC1
DEC0
PC0FETCH0
FETCH1PC1
ALU0
ALU1
WB
MEM1
DEC1
DEC0
PC0FETCH0
FETCH1PC1
MEM0
MEM2
MEM3
WB
����������������������������
������������������������DEC1
DEC0
PC0FETCH0
FETCH1PC1
������������
������������
WB
Bypass
gcDISP
DISP
PCsel
KILL
Bypass of T
WB stall
1
2
3
4
Orig CLK
CLK1
CLK2
Two consecutive ALU operations (1,2)
ALU operation (3) followed by a conditional branch (4)
5
6
7
8
9
10
Memory access (7), ALU operations (8) to (10)
Figure 7.2: DEVIL's pipeline implementation with non-overlapping clocks.
7.4 DEVIL Implementation
7.4.1 DEVIL's Datapath
Figure 7.3 shows the datapath of the DEVIL processor. DEVIL has a Harvard architecture
(i.e., separated data and code memory). The interface with the data memory is 32-bit wide,
while the code memory is 64-bit wide. The instruction fetch in memory occurs during the
�rst phase of the Fetch stage (F0) and the 64-bit instruction bundle is stored in an instruction
register (IR) on CLK2. During the second phase of the Fetch (F1), a state machine uses the
tag information to control two instruction dispatchers that send operations to one of the four
functional units. The instruction dispatchers work in parallel and can send operations to a
subset of the functional units. One dispatcher sends operations to the branch unit and the
ALU1, while the other sends operations to the ALU2 and the load/store unit. This subdivision
simpli�es the dispatcher implementation without restricting parallelism. Note that the dispatch
is made to four separate functional unit datapaths to e�ciently gate the datapaths that are not
used.
The decode stage (D0, D1) consists of four decoders that decode operations and read the
register operands. This latter operation occurs in the second phase to avoid spurious power-
consuming reads. The register �le has four read ports. The branch decoder has the particularity
of executing the branch operations.
The execution stage (E0 to E3) is composed of four units that can generate a result. These
results are forwarded to the register �le via a writeback unit. DEVIL issues two instructions
per cycle. In order to support this operation throughput, the register �le has three write ports.
The third write port is required because the load operations have latencies 1-cycle greater than
other units, meaning that three instructions can be completed at the same time. The destination
storage elements of the register �le are sensitive to CLK1.
7.4 DEVIL Implementation 81
Execute
Load/Store
ALU2AddressExecute
Load/Store
Write Back
Execute
Branch ALU1Save Regs
&
Register
Inte
rrup
tion
Ha
ndlin
g
Regiters
File
Macro
BranchExecute
Load/StoreDecoder
ALU2DecoderDecoder
ALU1
Code Memory
DataMemory
64 bits
32 b
its
DEVILF1F0
D0-
D1
E0-E
1-E2
-E3
Slot 0 Slot 1
Figure 7.3: DEVIL datapath block diagram.
7.4.2 Fetch and Dispatch Unit
Figure 7.4 shows the block diagram of the DEVIL dispatch unit that corresponds to the second
substage of the Fetch pipeline stage (F1). DEVIL fetches 64-bit instruction bundles composed
of 5 di�erent parts (tag, s0, s1, s2, s3). The four-bit tag is used by a simple 4-state �nite state
machine (FSM) that controls four groups of multiplexers. The four states of the FSM map the
maximum of 4-cycle bundle execution time. The four groups of multiplexers are subdivided
into two slots (Slot 0 and Slot 1). Each of these slots can send a 15-bit or a 30-bit operation
toward a subset of two of the four functional units.
s0 (15 bits)
SequencerInstruction
ALU2 + Load/StoreBranch +ALU1Slot 1Slot 0
extensioninstruction15-bit
s1 (15 bits)s2 (15 bits)
64-bitbundle
15-bitinstruction
tag (4 bits)
s3 (15 bits)
15-bit 15-bitextension
Figure 7.4: Fetch and dispatch datapath.
7.4.3 Program Counter Datapath
Figure 7.5 shows the datapath that computes the program counter, i.e., the core of the branch
functional unit. The �rst substage of the program counter datapath increments the PC (PC+1
82
Implementation ofthe DEVIL Processor
and PC+displacement), while the second stage selects the new PC among, for example, the two
possible outcomes of a conditional branch operation. The main particularity of the program
counter datapath is the duplication of the PC required to gate the transition of the 32-bit adder
that computes the PC-relative addresses.
FL00L FL01L
gcPCDispSmall
pcDispSmall[9:0]
pcDispLarge[14:0]
pcAddLatch
pcIncrLatch
clk1
clk1
clk1
clk1
64-bit memory accessalignment
clk2
clk2
selPCLatch clk2
gcPCDispLarge
codeMemAddr[31:0]32
PC = PC + 1 + Disp32
sig
nExt
PCD
isp
Ad
dr
32
inc
rPC
pc
Ad
de
r
+
pcDispLatchSmall
pcDispLatchLarge
gc
Pro
gra
mC
oun
ter2
signExtPCDisp
gc
PCA
dd
gc
PCIn
cr
retAddr[31:0]
retAddrI[31:0]
interruptAddr[31:0]
registerAddr[31:0]
selPCSource[2:0]
gcSelPC
0
10
15
programCounter1
programCounter2
selPC
5 0
3
gc
Pro
gra
mC
oun
ter1
codeMemAddrUnit.vhd
Figure 7.5: Program counter datapath.
7.4.4 Register File
The register �le (Figure 7.6) contains sixteen 32-bit registers implementedwith latches. Register
r15 contains the stack pointer (SP), but it can be also used as a general purpose register. The
register �le has three input ports and four output ports. The latch implementation means that
the bypass logic is free (see Figure 7.2 operations � and �).
7.4.5 Arithmetic and Logic Unit
Figure 7.7 illustrates the datapath of the ALU. The �rst ALU substage integrates a barrel
shifter that can shift one operand up to 32 positions in either direction with or without sign-
extension. Furthermore, a logic operation module �lters the operands in order to implement
the di�erent ALU functionalities. The second substage always executes an addition of the
two operands modi�ed in the previous substage. For example, a subtraction is performed
by inverting operand B, in the logic block, and forcing the input carry of the adder to one
(A�B = A+NOT (B) + 1).
In parallel to the ALU, there is a �ow through unit that allows to execute move operations
at low power cost.
7.4.6 Load/Store Unit
The Load/Store execution unit (see Figure 7.8) is implemented on two pipeline stages, corre-
sponding to four substages. In the �rst substage the data memory address is computed by an
adder. The remaining substages are used to perform the memory access. The last substage also
includes a cast and alignment mechanism to support di�erent size accesses.
7.4 DEVIL Implementation 83
regFile.vhd
muxOutRegFile
clk1
clk1
clk1
clk1
clk1
clk1
clk1 clk1 clk1 clk1 clk1 clk1 clk1 clk1 clk1 clk1
selSrc3[3:0]
selSrc2[3:0]
reg15reg14reg13reg12reg11reg10reg9reg8reg7reg6reg5reg4reg3
regSrc0[31:0] regSrc1[31:0] regSrc2[31:0] regSrc3[31:0]
selSrc0[3:0]
selSrc1[3:0]
reg0 reg1 reg2
decSourceRegFile
selDest1[3:0]
selDest0[3:0]
selDest1[3:0]selDest2[3:0]
decGCRegFile
selDest0[3:0] selDest2[3:0] regIn0[31:0] regIn2[31:0]regIn1[31:0]
enWR0enWR1enWR2
Figure 7.6: Register �le
ALU0
ALU1
aluOpA[31:0] aluOpB[31:0]
clk1 clk1
alu0ALatch alu0BLatch
and/or/xor/not
gcAlu1BLatch
clk2 clk2
alu1BLatch
aluadd aluFlag
aluOperation[3:0]
gcAlu1OpCLatchclk2
barrelshiftershl/shr/ashr
logicUnit
gcAlu0ALatch gcAlu0BLatch
alu1OpCLatch
aluOut[31:0]aluControlOut[5:0]
clk2gcAlu1CLatch
alu1CLatch
clk1 clk1 clk1gc
Alu
0CLa
tch
alu
Co
ntro
lIn[5
:0]
alu
Op
era
tion[
3:0]
Shift
erC
ont
rol[6
:0]
gc
Alu
0Op
CLa
tch
gc
Alu
0Co
ntro
lLa
tch
alu0OpCLatchalu0CLatch alu0ControlLatch
opShifter[4:0]modeShifter[1:0]
logicOperation[3:0]
alu1ALatch
gcAlu1ALatch
control_funcUnit.vhd funcUnitDatapath.vhd
funcUnit.vhd
Figure 7.7: ALU datapath.
84
Implementation ofthe DEVIL Processor
MEM0
MEM1
1 0
mem0DisplAddr
MEM2
MEM3
mem1Accessmem1Control mem1RndWr
opMemDispl
gcMem1Data
gcMem1Addr
opMemDispl
gcMem1Seln_stall
stall
stall
dm
Co
ntro
lIn
dm
Rea
dnW
rite
dm
Ca
lcD
isp
l
dm
Do
Me
mA
cc
ess
clk1
mem0Control mem0Access mem0RdnWr mem0Displ
n_st
all
clk1n_st
all
n_st
all
clk1 clk1n_st
all
gcMem0Data
gcMemNoDisplAddr
gcMemDisplAddr
n_stall
clk2n_st
all
clk2 clk2n_st
all
n_st
all
clk1n_st
all
n_st
all
clk1n_st
all
clk1
gcMem1Wait
clk2
clk2
clk1
+
clk1 clk1 clk1 clk1
mem0BaseAddr mem0DisplAddr mem0Addr mem0Data
mem0Select mem1RdnWr mem1Addr mem1Data
clk2 clk2clk2
dmBaseAddr dmDisplAddr dmNoDisplAddr dmDataIn
dmbDataOutdmbAddressdmbRndWrdmbSelectdmbWait
mem2Wait
dmbDataIn
dmDataOut
clk2
datamemunit.vhd
datamemunitdatapath.vhd
gcMem3Data
opWaitNeeded
control_datamemunit.vhd
dmControlOutdmStallOut
stall
mem2Access
n_st
all
mem2Control mem2RndWr
clk2
mem3Control
Figure 7.8: Data Memory Unit
7.5 DEVIL Features
This section details the features of the current implementation of the DEVIL processor. The
reported numbers are estimates computed using the Synopsys 1998.98 tool.
7.5.1 DEVIL's Circuit Speed
The current implementation of the DEVIL processor runs at an estimated 50 MHz at 1.6 volts,
conferring to DEVIL an estimated performance level of 75 Dhrystone v2.1 MIPS. The critical
path is in the ALU datapath.
This is a low circuit speed for a 0.25� technology, an observation that can be explained
by several reasons. First, as stated earlier in this chapter, the VHDL implementation implies a
loss in circuit speed. This means that a full custom design would reach higher circuit speeds.
Furthermore, the �rst implementation of DEVIL is based on a 3-stage pipeline and therefore the
pipeline should be aggressively optimized to reach high clock frequencies. These optimizations
are time-consuming and have strong resource requirements that are beyond the scope of this
work.
7.5 DEVIL Features 85
7.5.2 DEVIL's Circuit Complexity
The transistor count of the DEVIL processors circuit complexity is approximately of 125'000
transistors. The cost of duplicating hardware is acceptable since DEVIL's transistor count
is in the lower bound of the current mobile processor transistor budget. For example, the
StrongARM has 2.1 million transistors, including the cache [71].
Module Transistor count breakdown
Fetch unit 4%
Decoder ALU1 8%
Decoder ALU2 8%
Decoder Load/Store 6%
Branch unit (incl. decoder) 8%
ALU1 9%
ALU2 9%
Load/Store unit 8%
Register File 38%
Writeback unit 2%
Table 7.1: Transistor count breakdown.
Table 7.1 shows the breakdown of the transistor count. The support for the variable
length instruction bundle represents only approximately 4% of the total circuit complexity.
This increase in complexity can be considered negligible considering the bene�ts in terms of
code size and memory tra�c that this system confers.
7.5.3 DEVIL's Circuit Power Consumption
The power consumption of DEVIL is estimated at 60 mW for a power supply voltage of 1.6 V
when running at 50 MHz. Table 7.2 shows the power consumption breakdown of the DEVIL
processor. The main source of power consumption is the register �le, with around 20% of the
total power dissipation.
Module Relative power consumption
Fetch unit 8%
Decoder ALU1 11%
Decoder ALU2 11%
Decoder Load/Store 7%
Branch unit (incl. decoder) 9%
ALU1 12%
ALU2 12%
Load/Store unit 6%
Register File 21%
Writeback unit 3%
Table 7.2: Power consumption breakdown.
Thanks to the implementation of DEVIL and to the extraction of the design features, it is
possible to estimate the extra energy consumption caused by the introduction of the multiple-
issue pipeline and of the dispatch mechanism. The two main sources of extra power consumption
are the dispatch unit and the register �le. The cost, in terms of additional energy consumption,
of the introduction of the VLIW architecture is estimated at 30%. Table 7.3 summarizes the
bene�ts of DEVIL compared to a 1-issue processor that executes DEVIL's instruction set.
These numbers are based on the 1.5 average speed up achieved by the DEVIL processor (see
86
Implementation ofthe DEVIL Processor
Chapter 6). The parallelism allows DEVIL to run at 50 MHz at 1.6 volts while reaching the
same performance than the 1-issue processor powered at 2.2 volts and running at 75 MHz. This
is because DEVIL requires less energy to execute a given task in the same amount of time than
the 1-issue machine. The gain is of around 38% (for the average speed-up of 1.5). These results
validate the advantage of VLIW architectures in terms of energy e�ciency.
Processor Vdd Frequency MIPS Power MIPS/W MIPS2/mW
DEVIL 1.6 V 50 MHz 75 60 mW 1250 94
1-issue 1.6 V 50 MHz 50 31 mW 1613 81
1-issue 2.2 V 75 MHz 75 98 mW 765 57
Table 7.3: Summary of the bene�ts of ILP for low-power.
7.6 Comparison With Existing Processors
Table 7.4 summarizes the features of the DEVIL processor compared to existing low-power
processors available in the market today. As stated in the previous chapter, this comparison
only indicates where the DEVIL features are situated, considering that DEVIL's design can
be much more optimized. DEVIL's estimated features attain good MIPS=W and MIPS2=W
values, that lead to believe that, if optimized, the DEVIL architecture can o�er very attractive
features. At the moment, the major limitation is the clock frequency.
Model Vendor Techno. Vdd Freq. Power MIPS MIPS/W MIPS2/mW
ARM710 VLSI 0.8� 3.3 V 25 MHz 120 mW 30 250 8
SH7708 Hitachi 0.5� 3.3 V 25 MHz 95 mW 25 263 7
StrongARM Intel 0.35� 2 V 230 MHz 360 mW 268 744 200
ARM940T VLSI 0.35� 3.3 V 150 MHz 675 mW ?160 ?237 ?38
MMC2001 Motorola 0.35� 2 V 34 MHz 80 mW 31 387 12
TR4102 LSI 0.25� 1.8 V 80 MHz 40 mW ?90 ?2250 ?203
SH7750 Hitachi 0.25� 1.8 V 200 MHz 1.6 W 300 188 56
DEVIL 0.25� 1.6 V 50 MHz 60 mW 76 1266 96
Table 7.4: Mobile, Embedded, and ILP processor comparison.
7.7 Conclusion
This chapter described the VHDL implementation of the DEVIL processor. Thanks to this
implementation, estimates of the circuit complexity, circuit speed, and circuit power consump-
tion were computed, allowing an evaluation of the bene�ts of VLIW architectures for low-power
processors.
In terms of circuit speed, DEVIL runs at 50 MHz, which is quite slow for a 0.25� technology.
This is due to the synthesis methodology approach, as well as to the lack of resource to optimize
the DEVIL datapath.
The complexity of DEVIL was estimated to be around 125'000 transistors, categorizing
DEVIL as a simple circuit that should have a small die area. Furthermore, it was shown that
the dispatch unit introduced to handle the variable instruction length increases the circuit
complexity by only 4%, which is negligible considering the bene�ts of such mechanism.
7.7 Conclusion 87
Also, it was shown that ILP improves energy e�ciency by around 38% on average. This
confers to DEVIL the attractive possibility to execute code at the same speed than a scalar
processor while consuming less power.
This chapter allowed to justify the use of VLIW architectures into low-power processors.
The next step will be to optimize DEVIL's datapath according to the feedback of this �rst
prototype and build the �rst chip in order to get the exact circuit features.
88
Implementation ofthe DEVIL Processor
Chapter 8
A Step Towards Predicated Execution
Introducing instruction-level parallelism into processors requires a strong compiler support.
High-Level Languages (HLLs) are generally used to reduce the product time to market. Un-
fortunately, the use of compilers and HLLs can have severe repercussions on the quality of
code compared to the traditional methods of hand-coding programs. First, compiler technology
a�ects the instruction memory utilization and code size. Although classic code optimizations
decrease the number of executed instructions, superscalar optimization, inline expansion, loop
unrolling, and superblock formation often increase the execution performance at the cost of
increasing the overall code size (see subsection 6.7.4). Second, although HLLs algorithm in
systematic ways that are good for maintenance and debugging purposes, the machine can po-
tentially be limited in performance due to its extremely sequential control �ow. Such problems
can seriously impact the processor's performance and cost (i.e., code size), which are critical in
embedded systems.
As the use of HLLs becomes inescapable in embedded systems, new compilation techniques
and hardware support should be used to overcome the HLLs barriers. Predication has several
features in terms of control �ow representation, performance, and code size that makes it very
attractive for both embedded and high-performance systems. To take advantage of such appre-
ciable features, several new compiler and architectural support are required. Full predication
support has been introduced in the new generation of high-performance processors such as the
HP/Intel IA-64 architecture [25]. For embedded architectures, predicated execution is gener-
ally supported via the use of a conditional move instruction. This partial predication support
reduces the bene�ts of predication as compared to full predication support [43]. However, if full
predication leads to a better code quality enhancement, it requires signi�cant changes in the
instruction set architecture, namely the addition of a new source operand for each instruction.
This Chapter investigates how full predication support can be introduced into embedded
architectures while meeting their strong constraints. Section 8.1 introduces the predicate de�ne
instructions, one of the most important component of a predicate architecture. Section 8.2
gives an overview of the bene�ts of predication in terms of code size (i.e., system cost). Sec-
tion 8.3 proposes a new way to introduce predicated execution support in embedded processors.
Section 8.4 addresses the control �ow optimization problem and presents a general compiler
framework that uses predication to optimize the control �ow of a program. Note that this
latter is valuable for both embedded and high-performance processors. Finally, Section 8.5
concludes.
89
90 A Step Towards Predicated Execution
8.1 Architecture Support for Full Predicated Execution
Predicated execution (see Section 2.6.3), the central architectural feature examined in this
chapter, is a mechanism that facilitates the conditional execution of individual instructions [54].
Predicates are registers that store a single bit value, representing either TRUE or FALSE.
Each instruction is associated with a particular predicate, known as its guard predicate, that
determines its execution. In the case when an instruction's guard predicate is TRUE, it executes
normally. Conversely, when an instruction's guard predicate is FALSE, it is nulli�ed.
The most important component of a predicate architecture is the instruction set support
for computing predicates or the predicate de�ne instructions. Predicate de�nes are inserted by
the compiler to generate values for control of conditional execution. The PlayDoh predicate
de�ne instruction [22] set provides the baseline for this work and is summarized below.
PlayDoh types
pSRC Comp UT UF OT OF AT AF
0 0 0 0 - - - -
0 1 0 0 - - - -
1 0 0 1 - 1 0 -
1 1 1 0 1 - - 0
Table 8.1: Predicate de�nition truth table.
PlayDoh is a parameterized Explicitly Parallel Instruction Computing (EPIC) architecture
intended to support public research on ILP architectures and compilation. PlayDoh predicate
de�ne instructions generate two Boolean values (pD0 and pD1) using a comparison of two
source operands ( src0 and src1 ) and a source predicate (pSRC). A PlayDoh predicate de�ne
instruction has the form:
pD0 type0; pD1 type1 = (src0 cond src1) hpSRCi.
The instruction is interpreted as follows: pD0 and pD1 are the destination predicate registers;
type0 and type1 are the predicate types of each destination; src0 cond src1 is the comparison,
where cond can be equal (==), not equal (! =), greater than (>), etc.; pSRC is the source
predicate register. The value assigned to each destination is dependent on the predicate type.
PlayDoh de�nes three predicate types, unconditional (UT or UF), wired-or (OT or OF), and
wired-and (AT or AF). Each type can be in either normal mode or complement mode, as
distinguished by the T or F appended to the type speci�er (U, O, or A). Complement mode
di�ers from normal mode only in that the condition evaluation is treated in the opposite logical
sense.
For each destination predicate register, a predicate de�ne instruction can either deposit
a 1, deposit a 0, or leave the contents unchanged. The predicate type speci�es a function of
the source predicate and the result of the comparison that is applied to derive the resultant
predicate. Table 8.1 shows the deposit rules for each of the PlayDoh predicate types in both
normal and complementmodes. Each entry corresponds to the result assigned to the destination
predicate. Note that a �-� means that the destination is left unchanged.
As shown in the table, the unconditional types are always assigned a value. For the
UT-type, the value corresponds to the logical conjunction of the source predicate and the
comparison result. Conversely, the or-type and the and-type each only assign a value in one
circumstance. The OT-type conditionally writes a 1 if both its source predicate and comparison
8.2 Compiler Techniques for Reducing Predicated Code Size 91
result are TRUE. The or-type can be used to e�ciently compute the disjunction of multiple
compare conditions by accumulating terms into an initially cleared predicate register. Since the
operations computing terms conditionally write the same value, they can execute in any order
or even in parallel. Similarly, the and-type can be used to compute the conjunction of multiple
compare conditions by accumulating terms into an initially set predicate register.
8.2 Compiler Techniques for Reducing Predicated Code Size
One very attractive feature of predication is that it allows to reduce the code size penalty
introduced by ILP optimizations and the traditional conditional branch representation of the
control �ow, while enabling to reach a better level of performance. In order to understand how
predication can be used to reduce code size, this section presents some examples extracted from
the MediaBench suite [35]. The compilation techniques utilized in these examples to exploit
predicated execution are based on hyperblock formation [44].
8.2.1 Reduction of Number of Control Instructions
Predicated execution o�ers a fundamentally di�erent method of expressing program execution
to the architecture. By design, instructions are guarded with predicates rather than by directing
the instruction execution stream to a particular path. The �rst benchmark example illustrates
the way predicated execution support in the ISA can reduce the number of control instructions
in a program. Figure 8.1(a) shows a control graph of code for the function re�ect1 from the
benchmark expic.
The instruction sequence contains 13 basic blocks with a total of 18 instructions, 8 of which
are branches. There are four conditional branches, with only two unique branch conditions B1and B2 de�ned by the source code. The control overhead in the instruction sequence is 8/18 =
44%. The same code after optimization is shown in Figure 8.1(b). The ine�ciencies of the code
of Figure 8.1(a) are reduced by performing branch outcome propagation and tail duplication
from the �rst instance of branch B1 to the other occurences. The optimized code contains 19
instructions, six of which are branches. The control instruction overhead is reduced to 6/19 =
32%, but at the cost of increasing the overall code size.
The instruction sequence with predicated execution is shown in Figure 8.1(c). The instruc-
tion count reduction for the predicated code comes from eliminating the unconditional jump
instructions required to represent the control �ow of the program to the architecture. As a
result, only two predicate de�ning instructions are used to control the sequence of execution.
The number of total instructions is 10, and the control instruction overhead is only 2/10 = 20%.
The number of control instructions is reduced by 75%, from 8 to 2 instructions with predicated
execution. For the whole expic benchmark, similar results of control instruction reduction are
observed.
8.2.2 Predicate Promotion and Instruction Merging
Predicate promotion refers to speculation performed by removing the predicate from a pred-
icated instruction [44]. Promotion results in the instruction being unconditionally executed,
essentially reducing the number of predicated instructions. Predicate instruction merging is a
form of promotion that allows identical instructions on complementary or intersecting predi-
cate conditions to be combined. Instruction merging thereby removes one instruction copy, and
promotes the remaining instruction to an earlier predicate condition.
92 A Step Towards Predicated Execution
blt r8, 0 blt r8, 0
neg r21, r8neg r20, r8mov r20, r8jmp
sub r18, r11, r20jmp
mov r21, r8jmp
blt r8, 0
neg r19, r8 mov r19, r8jmp
B1
FT
ld r82, r18
<p3>
...
B2
B1 B1
TF TF
blt r63, 0
TF
sub r63, r11, r19
sub r18, r21, r11
blt r8, 0
ld r82, r18
B2blt r63, 0
sub r18, r11, r20
T T
sub r18, r21, r11neg r21, r8
jmp
neg r20, r8
neg r19, r8
B1
T F
mov r19, r8
sub r63, r11, r19
blt r63, 0
jmp
B2
sub r18, r11, r20mov r20, r8
sub r63, r11, r19
mov r21, r8sub r18, r21, r11
jmp
p1_ut, p2_uf = (r8 < 0)neg r19, r8 mov r19, r8sub r63, r11, r19
<p1><p2>
p3_ut, p4_uf = (r63 < 0)neg r21, r8mov r21, r8sub r18, r11, r21sub r18, r21, r11ld r82, r18
<p1><p2><p4>
(a) (c)(b)
Figure 8.1: Predication example: (a) original, (b) optimized, and (c) predicated.
Figure 8.2(a) shows the source code for part of a switch statement in the function gl_DrawBu�er
from the benchmark mesa, an application using the OpenGL graphics library. There are sev-
eral aspects of the code that allow predicated execution to reduce the instruction count. First,
several case values activate the same program statements in the switch construct. Second, the
di�erent groupings of case values have statements in common. In fact, the only di�erence across
the three switch groupings is the source operand of the second statement, that selects either
the FrontAlpha, BackAlpha, or NULL value. The instruction and control �ow of the switch
are illustrated in Figure 8.2(b). The traditional way of executing the switch statement is by
executing several sequential branch instructions illustrated by the sequence of B instructions.
Other case values are not shown to make the example concise; however it is important to note
that subexpression elimination of the common instruction sequences are not possible for all case
values of the switch construct.
With predicated instruction support, the compiler is able to if-convert all the instructions
of the portion of the switch statement illustrated. After if-conversion, merging and predicate
promotion optimizations can be applied to predicated instructions. Figure 8.2(c) illustrates
the predicated code after optimization. The instructions that are common on both paths were
merged, and unconditionally executed. Only the dark-shaded instructions require predicate
operands. The �nal predicated code of Figure 8.2(c) illustrates the e�ectiveness of instruction
merging and predicate promotion. Since many of the instructions between the three switch
groupings are identical, the instructions can be merged together into a single copy. Only
the individual, non-shared instructions illustrated by the dark shading are predicated. The
predicate de�ning instructions indicated with a P perform this function. Also, additional
predicate and jump instructions are used in the second and third rows of the predicated code to
direct execution to the other case value statements. Although the number of control instructions
is slightly reduced, the real code size reduction comes from the sharing of instructions from
di�erent control paths while preserving performance. Overall, the instruction merging causes a
signi�cant reduction in the total number of instructions.
8.2 Compiler Techniques for Reducing Predicated Code Size 93
st ctx->Color.DrawBuffer, mode (S1)
Block 3
ld r1, ctx->Buffer (L1)
Block 2Block 1
ld r2, r1->BackAlpha (L4)
jmp Exit (J)
jmp Exit (J)st ctx->NewStat, r3 (S3)
st ctx->NewStat, r3 (S3)or r3, r3, NEW_RASTER_OPS (O)
or r3, r3, NEW_RASTER_OPS (O)
or r3, r3, NEW_RASTER_OPS (O)
ld r3, ctx->NewStat (L3)
ld r3, ctx->NewStat (L3)
ld r3, ctx->NewStat (L3)
ld r1, ctx->Buffer (L1)
ld r1, ctx->Buffer (L1)
st ctx->Color.DrawBuffer, mode (S1)
jmp Exit (J)st ctx->NewStat, r3 (S3)
st r1->Alpha, r2 (S2)ld r2, r1->FrontAlpha (L2)
st ctx->Color.DrawBuffer, mode (S1)
ctx->Buffer->Alpha = NULL;
case GL_BACK:st r1->Alpha, r2 (S2)
st r1->Alpha, NULL (S4)Block 3
Block 2
Block 1
case GL_NONE:
... } ctx->NewStat |= NEW_RASTER_OPS;
ctx->NewStat |= NEW_RASTER_OPS;
ctx->Color.DrawBuffer = mode;
ctx->NewStat |= NEW_RASTER_OPS; ctx->Buffer->Alpha = ctx->Buffer->FrontAlpha;
ctx->Buffer->Alpha = ctx->Buffer->BackAlpha; ctx->Color.DrawBuffer = mode;case GL_BACK_LEFT:
ctx->Color.DrawBuffer = mode;
case GL_FRONT_LEFT:
switch(mode){
case GL_FRONT_AND_BACK:
case GL_FRONT:
(a)
JO
S2J1L2 L4
PS4
B1
S1 L1L4S2L3
S3 JO
S4L3
S3 JO
S1 L1
(b) (c)
S1 L1P1 P3 P4 P5 P6P2
S1 L1L2S2L3
S3 JO
B6B5
B4B2
B3
S3
L3
Figure 8.2: Merging example: (a) source code, (b) original, and (c) predicated.
8.2.3 Instruction Reduction for Advanced Code Transformation
Predication's ability to reduce the number of instructions can also enable some code growth re-
duction in high performance optimizations. Consider the loop example of function extend_image
of expic in Figure 8.3(a). The loop is dominated by conditional branches in blocks A,B,C, andD, while the only computation of the loop is in block E. The conditional branches of blocks
A and C are loop invariant, but program variant. Without predication, the only way to take
advantage of the invariance characteristic is by using loop versioning. However, by versioning
several instances of the loop, a signi�cant amount of code growth occurs. The highlighted
path of Figure 8.3 indicates the frequently taken path of the loop ACDEF. Superblock ILP
and unrolling compilation techniques are applied to construct a superblock of the frequently
taken path that is loop-unrolled twice. The resulting control �ow representation is shown in
Figure 8.3(b). Several code blocks are tail duplicated, leading to code expansion.
With full predicate support, the compiler is able to perform several optimizations that
reduce the code size of the loop. First, instead of tail duplicating the code to form a superblock,
a hyperblock is constructed by if-converting all of the basic blocks of the loop. All conditional
94 A Step Towards Predicated Execution
branches of the original loop are replaced by predicate de�ning instructions. The corresponding
predicate de�ning instructions of the loop invariant, program variant branches of blocks A and
C are removed from the body of the loop and are placed in the header of the loop. The resulting
conditions computed by these instruction are placed in predicate registers for the duration of
the loop. It is unnecessary to replicate the predicate de�ne instructions in each iteration since
their results are loop invariant. This is one fundamental advantage of using predication to
convert control �ow dependences into data dependences.
beq r1, -1sgn r1, x_filt
F
F
T
F
T
F
T
FF
T
F
T
T
F
F
F
F
F
T
F
F
T
TF
T
T
add result, result, r6
ld r5, filt[xfilt]
sgn r1, x_filtp3_ot = (r1 == -1) <p1>
p3_ot = (r2 == 1) <p2>sgn r2, xfilt
if p3E
Fsgn r1, x_filtp3_ot = (r1 == -1) <p1>
p3_ot = (r2 == 1) <p2>sgn r2, xfilt
if p3E
F
p1_ut, p2 uf = (y_base==0) A and C
B
D
B
D
T
add r6, r5, r5
ld result, r3, r4
ld r4, clip[x_base+x_edge]
ld r3, clip[y_base+y_edge}
inc x_filt
blt x_filt, y_filt+x_stopinc x_edge
sgn r2, x_filt
bne y_base, 0
beq y_base, 0
E
beq r2, 1
F
D
C
B
A
F
F
T
F
F
F
T
T
T
C
D
F
E
C
A
E
A
B
C
F
D
E
F
E
D
A
D
F
C
B
(a) Basic blocks (b) Superblock with loop unrolling (c) Hyperblock
Figure 8.3: Loop optimization example: (a) original, (b) unrolled superblock, and (c) unrolled
predicated.
8.3 Introducing Predication Support into Embedded Processors
Previous section illustrates how predication can be used to reduce code size. However, predi-
cated execution requires several changes to existing ISAs, which can a�ect program code size.
Indeed, there is a major tradeo� in the design of the instruction set, namely the addition of a
predicate source operand for all instructions. This section proposes a new framework for intro-
ducing predication into embedded processors. The �rst part of this section presents the e�ect
of the ISA modi�cation, due to full predicated execution support, on program code size. The
second part of this section propose a new instruction issue mechanism that reduce the impact
on code size of the ISA modi�cation by supporting predicated and non-predicated versions of
instructions.
8.3.1 E�ect on Code Size of Full Predication Support
Although the performance increase of full predicated execution is signi�cant, it is at the cost
of adding a predicate source operand on every instruction. In full predication model, all in-
structions have a predicate source operand, even those which are not conditionally executed.
Figure 8.4 illustrates the percentage of static instructions with conditional predicates relative to
8.3 Introducing Predication Support into Embedded Processors 95
the overall number of instructions. The percentage of conditional instructions averages around
40% of the total instructions, meaning that a large portion of instructions do not require a pred-
icate operand. Since the percentage of unconditional instructions is signi�cant, the unnecessary
increase in instruction format size can dramatically impact embedded system designs.
expi
c
g721
ghos
tscr
ipt
gsm
jpeg
mes
a
mpe
g
pegw
it
pgp
rast
a
raw
audi
o
wc
cmp
grep lex
qsor
t
yacc
com
pres
s
AV
ER
AG
E
0%
10%
20%
30%
40%
50%
60%
70%
Pred
icat
ed I
nstr
uctio
ns
Figure 8.4: Relative number of predicated instructions.
Figure 8.5 shows the code size expansion attributed to the predicate operand for three
distinct models on the same predicated benchmarks. First, Zero Size shows the code size
for predication when the predicate representation has zero cost. Next, Predicate Only shows
the e�ect when the instruction size growth of the predicate operand is attributed to only the
conditional instructions. Finally, Full Size shows the size of the operand added to every static
instruction as designed in an architecture supporting full predication. All of the predicated code
sizes are compared to a base architecture without predication support. Note that compilation
for predication alone has some e�ect on code size. The size of the predicate operand was
evaluated assuming a 24-bit base instruction format and a 5-bit predicate operand �eld.
Figure 8.5 indicates that predicated execution increases program code size by an average
of 23%, and often as high as 30%. The results of the Zero Size model of code size evaluation
indicate that for a large number of programs, predication e�ectively has fewer instructions
and reduced code size. An interesting pattern is observed in Figure 8.5 for Predicate Only
instructions. As a general rule, the code size for this model is signi�cantly smaller than the Full
Size code size, and averages near the base non-predicated code size. The di�erence between
predicated and non-predicated results occur because predication has a fundamental ability to
remove numerous control instructions and because compiler support of predicated execution can
perform optimizations that allow the code to share instructions that are on di�erent execution
conditions. For example, the instruction D = A + X in Figure 2.25(d) does not require a
predicate operand since the compiler guarantees that it unconditionally executes in the block.
96 A Step Towards Predicated Execution
expi
c
AV
ER
AG
E
com
pres
s
yacc
qsor
t
lex
grep
cmp
wc
raw
audi
o
rast
a
pgp
pegw
it
mpe
g
mes
a
jpeg
gsm
ghos
tscr
ipt
g721
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1.20
1.25R
elat
ive
Cod
e Si
ze
Zero Size Predicate Only Full Size
Figure 8.5: Code expansion considering predication source operand.
This optimization shows an important ability of the compiler to reduce the use of predication
in code, thereby increasing the disadvantage that traditional predicated architectures have by
requiring conditional and unconditional instructions to include a predicate operand. Further
details on the compiler's ability to a�ect code size and the percentage of predicated instructions
are presented in the next subsection.
8.3.2 Predication Code Size and Execution Characteristics
This subsection presents the static and dynamic characteristics of code size using predicated
execution and the e�ects of predicate optimization on code size. Figure 8.6 indicates that
predication reduces the total number of instructions for traditionally optimized code by 6.3%.
A signi�cant portion of the instructions eliminated were control instructions, which were reduced
by 13%, where control instructions include predicate de�ning instructions and any traditional
branch instructions. Other characteristics include a 7% reduction in the number of dynamically
executed instruction in general code, and a 31% reduction in the number of dynamic instructions
for code with superscalar optimizations.
Table 8.2 summarizes the amount of predicate optimization that the compiler is able to
perform on the hyperblocks. The optimizations are broken down into two categories, instruc-
tion merging and predicate promotion. For the instruction merging category, the percentage of
static predicated instructions averages 8% that can be merged. The additional code reduction
attributed to merging is shown in the next column, and indicates additional reduction in overall
code size. The percentage of predicated instructions that are promoted to unconditionally exe-
cuted instructions is shown in the next column. These numbers indicate that as many as 28% of
the originally predicated instructions of a hyperblock may be promoted with compiler optimiza-
tion. Since both merging and promotion can a�ect the same operation, the exact occurrence
8.3 Introducing Predication Support into Embedded Processors 97
expi
c
AV
ER
AG
E
com
pres
s
yacc
qsor
t
lex
grep
cmp
wc
raw
audi
o
rast
a
pgp
pegw
it
mpe
g
mes
a
jpeg
gsm
ghos
tscr
ipt
g721
30%
20%
10%
0%
Cod
e Si
ze R
educ
tion
Overall Control Operation
Figure 8.6: Code reductions due to predicated execution.
Benchmark Code Merging Hyperblock Predicate-Optimization
Merging % Reduction % Promotion % Static Pred % Static Pred %
expic 6.03 1.45 37.59 22.68 14.92
g721 1.60 0.85 43.31 52.64 29.77
ghostscript 0.26 1.01 32.68 41.31 24.79
gsm 3.21 1.80 51.44 44.78 23.28
jpeg 29.97 1.96 39.38 53.88 34.78
mesa 8.62 3.55 37.49 37.96 22.07
mpeg 5.05 2.40 34.52 46.03 26.13
pegwit 3.72 0.75 15.08 18.60 14.95
pgp 2.48 1.52 14.12 60.12 49.32
rasta 3.38 1.75 17.48 50.60 39.94
rawaudio 2.17 0.61 26.09 27.71 21.21
wc 16.92 7.91 10.77 43.33 40.29
cmp 22.12 11.57 16.81 46.89 37.04
grep 10.52 6.89 14.43 60.85 50.74
lex 11.87 5.44 14.97 43.15 32.75
qsort 8.00 1.61 48.00 20.49 11.90
yacc 7.30 2.48 26.26 32.83 21.11
compress 5.85 1.82 30.70 30.37 14.60
average 8.28 3.08 28.40 40.79 28.31
Table 8.2: Instruction merging and predicate promotion characteristics.
of which optimization has occurred is di�cult to collect within the compiler's infrastructure.
Nevertheless, the results of Table 8.2 indicate that relevant amounts of both optimizations af-
fect the percentage of predicated code. The �nal two columns include the percentage of static
predicated instructions relative to total program instructions, listed for the original hyperblocks
and the predicate-optimized hyperblocks. Although some instances of optimization on predi-
cated code lead to increases in the number of predicated instructions, the results of Table 8.2
show that in general predicate-optimizations signi�cantly reduce the percentage of predicated
instructions.
The most important characteristic in the results of Table 8.2 is that only 28% of the static
instructions remain predicated after agressive compiler optimization. This indicates that a large
number of instructions unconditionally execute and don't require a predicate operand. Thus,
the memory system of an embedded microprocessor is sacri�ced for the potential performance
98 A Step Towards Predicated Execution
gains. This analysis strongly supports the utility of an architecture framework which takes
advantage of predication's performance bene�ts while only adding size to those instructions
which are actually predicated.
8.3.3 Pre�x-Based Predication
Previous subsections show how a compiler using predication can reduce the program code
size, which is valuable for embedded systems. However, this bene�t can be diminished or
lost if the modi�cation of the ISA implies an increase in code size. Using the fact that after
optimization only a small percentage of instructions are predicated, this section details the
addition of predication to a 24-bit instruction word for embedded processors.
8.3.3.1 Architecture Model
Pre�x-based predication uses opcode pre�xing to add su�cient instruction bits to indicate that
a predicate operand exists for instructions which the compiler has designated to conditionally
execute. As illustrated in the previous section, a signi�cant amount of code size can be saved
when only the predicated instructions incur the predicate operand overhead. Figure 8.7 illus-
trates the base 24-bit instruction format that includes an operation code, a destination register
index, and two source operands (potentially register indexes or immediate data).
OP-CODE DEST SRC1 SRC0 OP-CODE DEST SRC1 SRC0 PRED
DECODER DECODERDECODERDECODE STRAGE
BYTE 9BYTE 10BYTE 11 I-CACHE
PREDICATED INSTRUCTIONNORMAL INSTRUCTION PREDICATE DEFININGLENGTH DECODER
AND STEERING STAGE
OP-CODE P_DEST PREDSRC1 SRC0
INSTRUCTION
BYTE 1
WILL BE USED IN THE FOLLOWING FETCH
BYTE 5BYTE 6
PREFIX
BYTE 8 BYTE 2BYTE 7 BYTE 3BYTE 4 BYTE 0
Figure 8.7: Pre�x-based predication decoding of normal and predicated instructions.
Figure 8.7 illustrates how a pre�x opcode of the 24-bit instruction can designate that an
additional 1-byte containing supplementary instruction information follows. The complete 32-
bit instruction can then be decoded into a 26-bit instruction with a 6-bit operation code, a 5-bit
predicate register index, a destination register index, and 2 source operands. The pre�x opcode
is then discarded. In this example architecture, the 5-bit predicate index can be used to access
a 32-entry predicate register �le. New predicate de�ning instructions for expressing predicate
conditions are also added using the pre�xng mechanism.
8.3.3.2 Microarchitecture support
The primary microarchitecture component a�ecting pre�x-based predication is the instruction
decode methodology. Most pre�x architecture designs integrate an additional instruction de-
code stage in the original pipeline design. In this model, the �rst stage is used to determine
instruction lengths (pre�x detection) and steer the instructions to the second stage where the
actual instruction decoding is performed. Figure 8.7 illustrates this process. The multiple
pipelined decode method is successful for several reasons. First, the design places the focus on
resources other than instruction memory. A second reason for using an additional decode stage
8.3 Introducing Predication Support into Embedded Processors 99
is that the number of branch instructions executed in a predicated architecture is signi�cantly
reduced, resulting in the number of mispredictions also being reduced. This limits the negative
e�ect of adding more pipeline stages before branch resolution has on the misprediction penalty.
The branch prediction accuracy for predicated architectures is about 7% higher than branch
prediction for traditional architectures.
8.3.4 Experimental Evaluation
8.3.4.1 Methodology
The IMPACT compiler and emulation-driven simulator were enhanced to support the proposed
architecture framework. The base architecture modeled uses a 5 stage pipeline that can issue
in-order 6 operations per cycle (up to the limit of the available functional units: four integer
ALU's, two memory ports, two �oating point ALU's, and one branch unit). The instruction
latencies used match the HP PA-7100 microprocessor (integer operations have 1-cycle latency,
and load operations have 2-cycle latency). The processor contains 32 integer and 32 �oating
point registers. To support pre�x-based predication, 32 predicate registers and an additional
decoding stage were modeled. The memory system simulated was either perfect or used a 2K,
4K, or 8K sized direct-mapped instruction caches and a 8K direct mapped, blocking data cache;
both with 64-byte blocks and a miss penalty of 12 cycles. A static branch prediction strategy
was employed.
8.3.4.2 Results and Analysis
lex
grep
cmp
expi
c
AV
ER
AG
E
com
pres
s
yacc
qsor
t
wc
1.0
2.14
2.02
raw
audi
o
rast
a
pgp
pegw
it
mpe
g
mes
a
jpeg
gsm
ghos
tscr
ipt
g721
1.1
1.86
1.2
1.3
1.4
1.5
1.6
1.7
Perf
orm
ance
2K 4K 8K
Figure 8.8: Performance of varying instruction cache size for pre�x-based predicated architec-
ture relative to non-predicated architecture.
Figure 8.8 shows the results of varying the instruction cache size for the non-predicated and
pre�x-based predicated architectures. Substantial performance improvement is established at
100 A Step Towards Predicated Execution
small cache sizes; however, for larger increases in instruction cache size, the relative perfor-
mance improvements of the base architecture are larger, and the relative performance saturates.
This indicates that the base model is more dependent on instruction cache resources than the
pre�x-based predicated architecture. The results of cache simulations show that pre�x-based
predication has an average 7% higher hit rate for 2K instruction caches and 2.5% for 8K caches
compared to the non-predicated model. Experiments also indicate that pre�x-based predi-
cation has an average 10% higher speedup over traditional predicated architectures for small
instruction cache models.
expi
c
AV
ER
AG
E
com
pres
s
yacc
qsor
t
lex
grep
cmp
wc
raw
audi
o
rast
a
pgp
pegw
it
mpe
g
mes
a
jpeg
gsm
ghos
tscr
ipt
g721
Rel
ativ
e C
ode
Size
0
1
2
3
4
5
6Non-predicated Full Predication Prefix Predication
Figure 8.9: Code expansion of superscalar relative to traditional optimization.
The relative performance of superscalar (superblock formation, loop unrolling) optimiza-
tion for pre�x-based predicated and non-predicated architectures is an average 63% better than
general levels of optimization for the simulation of a perfect memory system. For superscalar
optimization, the average speedup of the predicated architecture is only 12% more than the
non-predicated architecture. The performance of the superscalar optimization indicates that
the performance gains of predicated execution do not greatly exceed the non-predicated ver-
sion. However, the corresponding code size of the predicated code for high performance code
is signi�cantly reduced. Figure 8.9 shows the code expansion of the superscalar optimization
for the non-predicated, full-predicated, and pre�x-based predicated architectures. Clearly the
12% performance improvement is substantial since the improvement requires a signi�cantly
smaller code size. The full predicated architecture has an average 11% smaller code size and
the pre�x-based predicated architecture has an average 25% smaller size.
8.4 Control �ow optimization using predication
Previous section described a way to introduce full predicated execution support into embedded
processors. Such support gives new opportunities to generate more optimized code, especially
in the control �ow domain.
One fundamental limitation of most branch handling techniques is that they do not sig-
ni�cantly alter the program's control �ow logic. As the compiler translates high-level language
control constructs into assembly-level branches, it does not alter the basic control structure.
Instead, techniques focused on exposing and increasing ILP within a �xed control structure
8.4 Control �ow optimization using predication 101
are applied. With control speculation, this is obvious. Control dependences are removed to
enable the motion of instructions above branches. The branches themselves are not altered.
Likewise, when predication is applied by the process of if-conversion, branches are transformed
into predicate computations and control dependent instructions are rendered conditional by the
addition of guarding predicates. This process converts control �ow and control dependences
into data �ow and data dependences, but preserves the original program's control structure.
Restricting a compiler to use the program's unaltered control structure is undesirable for
several reasons. First, a high-level language such as C or C++ represents program control �ow
in an extremely sequential manner through the use of nested if-then-else statements, switch
statements, and loop constructs. Each control construct is fully evaluated before proceeding to
the next. This sequential computation often de�nes the program critical paths that constrain
the available ILP. Second, programmers represent control �ow for understandability or for ease
of debugging rather than for e�cient execution on the target architecture. As a result, software
often contains redundant control constructs that are di�cult to detect with traditional compiler
techniques. These may involve evaluating the same conditions multiple times or evaluating
conditions that partially overlap. An e�ective ILP compiler should be capable of transforming
the program control structure to eliminate these problems.
The ability to restructure code aggressively is a critical feature of an e�ective ILP com-
piler. The most obvious situation where aggressive transformation is regularly applied is on
arithmetic expressions. Compilers often completely restructure the programmer's arithmetic
computations into more parallel forms using a variety of transformations. These include ex-
pression re-association, tree height reduction [34], and blocked back substitution [55]. Although
ILP compilers may aggressively restructure computation, they typically preserve the program's
original control structure. This conservative approach can seriously limit the level of e�ciency
as well as the level of ILP achieved in branch-intensive programs.
Motivated by the potential of aggressive techniques for transforming arithmetic expres-
sions, this section introduces a new approach to optimizing program control �ow. The goal
of this work is to develop a systematic methodology for reformulating program control �ow
for more e�cient execution on an ILP processor. Control expressed in branches and predicate
de�ne instructions is �rst extracted and represented as a program decision logic network . Then,
a new, more e�cient network is synthesized with the goals of reducing dependence height and
redundancy. To accomplish the desired optimization and synthesis, the program decision logic
network is modeled as a Boolean equation. Boolean minimization techniques are then applied
to simplify and optimize the equation. Finally, the optimized network is re-expressed in the
form of predicated assembly code. One unique feature of this approach is that all branches and
predicates within a segment of code are treated jointly in a systematic manner.
This section focuses on compiler techniques and architecture support for e�ective optimiza-
tion of programmatic control �ow. In particular, the aspects of the HPL PlayDoh predicate
de�ne instructions [22], that are the most useful for this purposes, are highlighted. During the
process of developing this compiler support for programmatic logic optimization, a new class of
predicate de�ne instructions were designed to extend the PlayDoh architecture to support the
optimizer more e�ectively. The key idea behind this extension is presented and its e�ectiveness
through simulation of compiled codes that use this extension is shown. These experiments show
that programmatic logic optimization indeed results in substantial performance improvements
in functions where control �ow is the major impediment to exploiting ILP.
102 A Step Towards Predicated Execution
8.4.1 Previous Work
Previous research in the area of control �ow optimization can be classi�ed into three major cate-
gories: branch elimination, branch reordering, and control height reduction. Branch elimination
techniques identify and remove those branches whose direction is known at compile-time. The
simplest form of branch elimination is loop unrolling, in which instances of backedge branches
are removed by replicating the body of the loop. More sophisticated techniques examine pro-
gram control �ow and data �ow simultaneously to identify correlations among branches [12][47].
When a correlation is detected, a branch direction is determinable by the compiler along one
or more paths, and the branch can be eliminated. In [47], an algorithm is developed to identify
correlations and to perform the necessary code replication to remove branches within a local
scope. This approach is generalized and extended to the program-level scope in [12]. The sec-
ond category of control �ow optimization work is branch reordering. In this work, the order
in which branches are evaluated is changed to reduce the average depth traversed through a
network of branches [79].
The �nal category of control �ow optimization research focuses on the reduction of control
dependence height. This work attempts to collapse the sequential evaluation of linear chains of
branches in order to reduce the height of program critical paths [56]. In an approach analogous
to a carry lookahead adder, a lookahead branch is used to calculate the taken condition of a series
of branches in a parallel form. Subsequent operations dependent on any of the branches in the
series need only to wait for the lookahead branch to complete. The control dependence height of
the branch series is thus reduced to that of a single branch. The mechanisms introduced herein
also serve to reduce control dependence height. This work, however, introduces an approach
to minimization and re-expression of control �ow networks that is far more general than those
proposed in previous work.
8.4.2 Limitations of PlayDoh
Section 8.1 describes the predicate de�ne instruction of the PlayDoH. However, Our new strat-
egy for the generation of predicated code identi�es several limitations of the PlayDoh instruction
set. These limitations are described and our proposed extensions to the PlayDoh predicate de-
�ne instruction set are presented in this subsection.
PlayDoh types New types
pSRC Comp UT UF OT OF AT AF _T _F ^T ^F
0 0 0 0 - - - - - 1 0 0
0 1 0 0 - - - - 1 - 0 0
1 0 0 1 - 1 0 - 1 1 0 -
1 1 1 0 1 - - 0 1 1 - 0
Table 8.3: Extented predicate de�nition truth table.
The major limitation of the PlayDoh predicate types is that logical operations can only
be performed e�ciently amongst compare conditions. There is no convenient way to perform
arbitrary logical operations on predicate register values. While these operations could be ac-
complished using the PlayDoh predicate types, they often require either a large number of
operations or a long sequential chain of operations, or both.
With traditional approaches to generating predicated code, these limitations are not se-
rious, as there is little need to support logical operations amongst predicates. The Boolean
8.4 Control �ow optimization using predication 103
minimization strategy described in the next subsection, however, makes extensive use of logical
operations on arbitrary sets of both predicates and conditions. In this approach, intermediate
predicates are calculated that contain logical subexpressions of the �nal predicate expressions
to facilitate reuse of terms or partial terms. The intermediate predicates are then logically
combined with other intermediate predicates or other compare conditions to generate the �nal
predicate values. Without e�cient support for these logical combinations, gains of the Boolean
minimization approach are diluted or lost.
Predicate De�ne Extensions. Two new predicate types are introduced to facilitate
generating e�cient code using our minimization techniques. These are referred to as disjunctive-
type (_T or _F) and conjunctive-type (^T or ^F). Table 8.3 (right-hand portion) shows the
deposit rules for the new predicate types. The ^T-type de�ne clears the destination predicate
to 0 if either the source predicate is FALSE or the comparison result is FALSE. Otherwise, the
destination is left unchanged. Note that this behavior di�ers from that of the and-type predicate
de�ne, in that the and-type de�ne leaves the destination unaltered when the source predicate
evaluates to FALSE. The conjunctive-type thus enables the compiler easily and e�ciently to
form the logical conjunction of an arbitrary set of conditions and predicates.
The disjunctive-type behavior is analogous to that of the conjunctive-type. With the ^T-
type de�ne, the destination predicate is set to 1 if either the source predicate is TRUE or the
comparison result is TRUE (FALSE for ^F). The disjunctive-type is thus used to compute the
disjunction of an arbitrary set of predicates and compare conditions into a single predicate.
8.4.3 Overview of Compiler Techniques
This subsection presents a conceptual overview of the program decision logic minimization
process, starting with the conversion of code to the predicated representation for subsequent
optimization. In order to simplify the extraction and manipulation of control expressions, the
compiler applies if-conversion and reformulation of non-branch control constructs to transform
all programmatic control �ow into the predicated representation. In the IMPACT compiler,
this conversion is fully performed within acyclic code regions formed using hyperblock formation
heuristics [44]. To a great extent, the ability of our control logic optimization techniques
to improve performance depends on the scope of these regions, as only the control structure
transformed into the predicate domain is available for subsequent optimization. In order to
promote e�ective hyperblock formation, aggressive function inlining is performed.
An example extracted from the UNIX utility wc illustrates the application and bene�t
of the described techniques. Figure 8.10 shows the code segment before and after complete
if-conversion. As shown in Figure 8.10(a), the code before if-conversion consists of basic blocks
and conditional branches (shown in bold) which direct the �ow of control through the basic
blocks. As shown in Figure 8.10(b), the code after if-conversion consists of only a single block
of sequential instructions, a hyperblock [43]. The conditional branches have been replaced
with predicate de�ne instructions (shown in bold) and the predicate registers de�ned have
been placed as source operands on all guarded instructions in accordance with their execution
conditions.
After if-conversion, control speculation is performed to increase opportunities for optimiza-
tion. Control speculation is a means of breaking a control dependence by allowing an instruction
to execute more frequently than is necessary. In a predicated representation, this is performed
in predicate promotion, the process by which predicate �ow dependences are broken and instruc-
tions are made to execute speculatively by changing an instruction's guard predicate to another
predicate, whose expression subsumes that of the original [44]. When instructions are aggres-
104 A Step Towards Predicated Execution
(a)
r24 = MEM[r3]
r23 = r24 + 1
MEM[r3] = r23
p4 = 0
p5 = 0
1
2
4
3
5
6
7
8
9
10
11
13
12
14
15
17
18
19
20
21
16
(b)
Jump Loop
r2 = 0 <p5>
p5_of = (r4 != 9) <p8>
p5_of, p8_ut = (r4 != 32)
F
<p7>
MEM[71] = r61 <p6>
r61 = r62 + 1 <p6>
r62 = MEM[r71] <p6>
r2 = r2 + 1 <p3>
MEM[r72] = r26 <p3>
r26 = r27 + 1 <p3>
r27 = MEM[r72] <p3>
p7_ut = (r4 != 10) <p4>
p5_of, p6_uf = (r4 != 10) <p4>
p3_ut = (r2 == 0) <p2>
p4_ot, p2_uf = (r4 >= 127) <p1>
p4_ot, p1_uf = (32 >= r4)r4 = MEM[r24]
F
r27 = MEM[r72]
r26 = r27 + 1
MEM[r72] = r26
r2 = r2 + 1
r2 = 0
MEM[71] = r61
r61 = r62 + 1
r62 = MEM[r71]
TF
F
T
T
T
F
F
T
T
Branch r4 >= 127
Branch r2 == 0 Branch r4 != 10
Branch r4 != 9
Branch r4 != 32
Branch 32 >= r4
Jump Loop
r24 = MEM[r3]
r23 = r24 + 1
MEM[r3] = r23
r4 = MEM[r24]
22
Loop:Loop:
Figure 8.10: A portion of the inner loop of the UNIX utility wc. The control �ow graph (a),
and the corresponding hyperblock formed after complete if-conversion (b).
sively promoted, some predicates may no longer be utilized as guards on computation. When a
predicate is no longer necessary, the program decision logic is simpli�ed. Figure 8.11(a) shows
the wc hyperblock segment after predicate promotion. Comparison with Figure 8.10(b) shows
that four instructions (12, 13, 16, and 17) have had their predicates promoted to the TRUE
predicate, denoted in the �gure as the absence of a source predicate. However, no predicates
were rendered completely unused by this process.
Next, the program decision logic network is constructed. Since predicates can only assume
Boolean values, predicates and predicate de�nes can be viewed as a combinational logic circuit.
To derive the Boolean function from a hyperblock, the compiler needs only to examine the
predicate de�ne instructions. Consider instructions 7 and 8 in Figure 8.11(a), in which the
expression for p1 can be written as: p1 = C0 and p2 can be written as: p2 = p1C1, where C0 is
the condition: (32 � r4) and C1 is the condition: (r4 � 127). The expression for p2, in terms of
conditions, is p2 = C0C1. In the course of this complete back substitution, expressions based on
condition variables are formulated for all predicate de�ne instructions. The composition of all
these expressions is the program decision logic network. This network can be modeled as a logic
circuit that represents all the decisions made in the program. The logic circuit has conditions
as its input and the predicates which control computation as its output. The multiple-output
Boolean logic circuit for the wc code segment is shown in Figure 8.11(b).
Once the logic circuit has been derived, many CAD techniques can be employed to simplify
the program decision logic network. In the IMPACT compiler, the derived Boolean function
is represented with a Binary Decision Diagram (BDD) [5]. The BDD algorithms used are de-
scribed in [13]. The predicate BDD contains the relationship among predicates as de�ned by the
network of predicate de�ne operations. The predicate BDD is used throughout the compiler as
a database for queries made by optimizations when operating on predicated code. For example,
8.4 Control �ow optimization using predication 105
(C0) (C1)
(C3)
(C5)
(C2)
(C4)
����
��������
����
��������
����
����
��
��������
����
p3
p4
r4 != 32
p7
r4 !=9
p8
32 >= r4 r4 >= 127
p2
r2 ==0
r4 != 10
p4
p7
p5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
p1
15
16
17
18
19
20
21
MEM[r73] = r23
r23 = r24 + 1
r24 = MEM[r73]
p5 = 0
p3 = 1
p5
r4 !=9
r4 != 32
r4 != 10
32 >= r4
r4 >= 127
r2 ==0
p3
p6
p6 22 Jump Loop
r2 = 0 <p5>
p5_of = (r4 != 9)p5_of = (r4 != 32)MEM[71] = r61 <p6>
r61 = r62 + 1
r62 = MEM[r71]
r2 = r2 + 1 <p3>
MEM[r72] = r26 <p3>
r26 = r27 + 1
r27 = MEM[r72]
p5_of, p6_uf = (r4 != 10)p3_at = (r2 == 0)p3_af = (r4 >= 127)p3_af = (32 >= r4)r4 = MEM[r24]
(a) (b) (c) (d)
<p1>
<p2>
<p4>
<p4>
<p3>
<p3>
<p6>
<p7>
<p8>
<p5>
p4 = 0
p5 = 0
1
2
r24 = MEM[r3]
r23 = r24 + 1
3
4
MEM[r3] = r23
6
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
r4 = MEM[r24]
p4_ot, p1_uf = (32 >= r4)p4_ot, p2_uf = (r4 >= 127)p3_ut = (r2 == 0)p5_of, p6_uf = (r4 != 10)p7_ut = (r4 != 10)r27 = MEM[r72]
r26 = r27 + 1
MEM[r72] = r26
r2 = r2 + 1
r62 = MEM[r71]
r61 = r62 + 1
MEM[71] = r61
p5_of, p8_ut = (r4 != 32)p5_of = (r4 != 9)r2 = 0
Jump Loop22
Loop: Loop:
p1
Figure 8.11: The wc hyperblock after speculation but before logic minimization (a) and its cor-
responding logic diagram (b). The hyperblock after logic minimization (c) and its corresponding
logic diagram (d).
one common query is to determine if one instruction executes only when another instruction
has executed. This query is equivalent to the dominance relationship in the control �ow do-
main. Here, the BDD is queried to determine if the predicate expression of one instruction
subsumes the predicate expression of another. Queries to the BDD are made in IMPACT by
the optimizer, the scheduler, and data�ow analysis.
For the purposes of decision logic minimization, the BDD provides a simple method by
which expressions describing the hyperblock logic can be derived. The only expressions re-
quested from the BDD are those expressions describing the essential predicates . Essential
predicates are those predicates that guard real computation instructions (any instruction that
is not a predicate de�ne). In Figure 8.11(a), the essential predicates are p3, p5, and p6. Predi-
cates p1, p2, p4, p7, and p8 are non-essential predicates as they are used only as intermediates
in evaluation of the essential predicates.
The BDD maintains a canonical representation of the decision logic functions, from which
a Boolean sum-of-products expression can be produced for any represented function. Note that
the expression thus generated re�ects the canonical nature of the BDD's internal representation,
and is usually not optimal for expressions with multiple product terms. Therefore, it is necessary
to optimize the derived expression before attempting to synthesize a predicate de�ning structure.
The expressions describing the evaluation of the essential predicates are optimized using
techniques which eliminate redundant terms in the function and which re-express the Boolean
function in a more parallel form. The resulting expression is reformulated back into predicate
de�ne instructions in the hyperblock. Section 8.4.4 presents the details of the Boolean logic
optimizers and reformulators studied in this work. These optimizers and reformulators must
balance the reduction of dependence height with the number of predicate de�nes that can
be accommodated in the code schedule. This involves making an accurate estimate of how
106 A Step Towards Predicated Execution
much time is available for computation of control functions based on the availability times of
conditions and when predicates need to be consumed. These and other considerations make
the design of an optimizer and a reformulator nontrivial.
Figures 8.11(c) and 8.11(d) show the reformulated hyperblock and corresponding logic cir-
cuit after the minimization process is complete. The number of logic gates in the circuit imple-
mentation is reduced from ten to three. In addition, the six-level gate network in Figure 8.11(b)
is reduced to a single-level gate network in Figure 8.11(d). All non-essential predicates were
also eliminated as part of this process. An example optimization performed on the logic circuit
takes the form: C0 + C1C0 ! C0 + C1. An application of this optimization occurs between
instructions 7 and 8 when computing p4.
The values of variables in the decision logic network are supplied by evaluating conditions
on predicate de�ne instructions. It is important to recognize that these variables are not neces-
sarily independent, and that knowledge of the relationships between these variables can allow
for signi�cant further optimization of the predicate de�ne structure. Consider the computation
of p6 in Figure 8.11(a). Instruction 10 computes p6 uf = C3 hp4i. Logically, this leads to the ex-
pression p6 = C3(C0+C1), where C0 = (32 � r4), C1 = (r4 � 127), and C3 = (r4 6= 10). Here,
since C3 implies C0 and excludes C1, the expression for p6 can be simpli�ed to p6 = C3. In our
approach, the relationships between conditions are represented in a BDD, termed the condition
BDD, which can be queried to determine if logical implications exist between conditions and, if
so, what they are. The current implementation of this mechanism identi�es �families� of integer
register-constant comparisons which are based on the same de�nition of a given register. Then,
within each family, a number line is created and divided into disjoint segments from which the
set of register values yielding a �TRUE� evaluation for any member condition can be composed
by union. Finally, the relationships between the comparisons are described in BDD form using
a �nite domain technique [14]. Various elements of the optimizer query this BDD to determine
the inherent relationships between conditions, which are the decision network's input variables.
Cycle Instructions issued
0 op1 op2 op3 op12 op16
1 op4 op6 op13 op17
2 op5 op7
3 op8
4 op9 op10 op11
5 op14 op15 op18 op19
6 op20
7 op21 op22
(a) Schedule for the hyperblock in Figure 8.11(a).
Cycle Instructions issued
0 op1 op2 op3 op12 op16
1 op4 op6 op9 op13 op17
2 op5 op7 op8 op10 op19 op20
3 op14 op15 op18 op21 op22
(b) Schedule for the hyperblock in Figure 8.11(c).
Figure 8.12: Comparison of the static schedules for the wc hyperblock before and after logic
minimization.
8.4 Control �ow optimization using predication 107
The overall e�ectiveness of the program decision logic minimization process on the wc
example is best shown by comparing the schedules of the code before and after optimization. For
illustration purposes, a six-issue processor with no restrictions on the combination of instructions
that may simultaneously be issued is assumed. Furthermore, all instructions are assumed to
have a latency of one cycle. Figure 8.12 presents the schedules for the example hyperblock
before and after optimization. The instructions in bold correspond to the predicate de�nes in
each hyperblock. The schedule for the pre-optimization hyperblock (Figure 8.12(a)) is relatively
sparse due to the sequentiality of the predicate de�nes. The overall schedule length is eight
cycles. The schedule after logic minimization is reduced by a factor of two. The chain of
predicate de�ne instructions in the original hyperblock is replaced by a parallel, more e�cient
computation in the optimized hyperblock. The reformulated hyperblock requires only a single
level of predicate de�nes to compute the essential predicates as opposed to the �ve-level network
used in the original code, yielding a signi�cant increase in performance.
Once the decision component has been optimized and reformulated back into the predicated
representation, further compiler transformations need to be performed. For machines without
real predication support, complete reverse if-conversion must be performed [76]. For machines
which support predication, partial reverse if-conversion can be employed to create the proper
balance of control �ow and predication for the target architecture [8].
8.4.4 Minimization of Program Decision Logic
The previous subsection provided an overview of the process of program control height mini-
mization through the optimization of the predicate de�ne network. This section describes in
detail the mechanisms by which the predicate de�ne optimizer generates new predicate de�ne
instructions to evaluate more e�ciently the program's essential predicate functions. The dis-
cussion in this section assumes that the program's decision logic has been represented by the
predicate BDD and the condition BDD, and that Sum-Of-Products (SOP) expressions for the
essential predicates have been extracted as described in the previous section. Once the pro-
gram decision logic has been extracted, program control is optimized and re-expressed in four
steps. First, sum-of-products expressions are formed to represent predicate functions in terms
of program conditions. These expressions are then optimized using condition analysis and tra-
ditional Boolean logic minimization techniques. The resulting optimized expressions are then
optionally factorized based on condition availability times and resource constraints. Finally,
program control is re-expressed in predicate de�ne instructions, either in a two-level network
or in a multi-level network, depending on whether or not factorization was performed.
The generation of an e�cient predicate de�ne network begins with the extraction and sub-
sequent optimization of the sums-of-products for the predicate functions. Figure 8.13(b) shows
the expressions extracted for the essential predicates in the wc example, as well as the condi-
tions to which the variables in the expressions correspond. Figure 8.13(a) shows the original
predicate de�ne network for reference. Since the control expressions are completely represented
by the predicate BDD in terms of conditions, the non-essential predicates are eliminated from
consideration. This process maps the predicate de�ne structure, in this case �ve stages of
predicate de�ne instructions, into a sum-of-products which can be synthesized into a two-cycle
sequence of predicate de�ne instructions. However, this expression can exhibit a large number
of redundant and constant-FALSE products, and must be re�ned before use in de�ne regen-
eration. From Figure 8.13(b), two-level regeneration of the unoptimized expressions of the wc
example would require thirteen predicate de�nes in the �rst level and six in the second, far
more than the seven required in the initial network.
108 A Step Towards Predicated Execution
<p4>
p4_ot, p1_uf = (32>=r4)p4_ot, p2_uf = (r4>=127)p3_ut = (r2 == 0)p5_of, p8_ut = (r4 != 32)p5_of = (r4 != 9)
<p1>p5_of, p6_uf = (r4 != 10)<p2>
<p7><p8>
p7_ut = (r4 != 10) <p4>
(a) Original predicate de�ne structure.
C0 (32>=r4)
C1 (r4>=127)
C2 (r2==0)
C3 (r4!=10)
C4 (r4!=32)
C5 (r4!=9)
p3 C0C1C2
p6 C0C3+C0C1C3
C0C3+C0C1C3+
p5 C0C3C4+C0C1C3C4+
C0C3C4C5+C0C1C3C4C5
(b) Conditions and original predicate
expressions.
p3 C0C1C2
p6 C3
p5 C3+C4+C5
(c) Optimized predicate expressions.
... p5_of, p6_uf = (r4 != 10) p5_of = (r4 != 9)
p3_af = (32 >= r4) p3_af = (r4 >= 127) p3_at = (r2 == 0) ...
p5_of = (r4 != 32)
(d) Optimized predicate de�ne structure.
Figure 8.13: Example: optimization of wc predicate network.
8.4 Control �ow optimization using predication 109
Simplify_funcs(func list)1 simplified func list = Empty_list();
2 FOREACH func IN func list DO
3 reduced func = Reduce_using_condition_BDD(func);
4 simplified func = Minimize_SOP(reduced func);
5 List_append(simplified func list, simplified func);
6 RETURN simplified func list;
Minimize_SOP(func)1 product list = func:product list;
2 new product list = product list;
3 WHILE NOT List_empty(new product list) DO
4 new insertion list = Empty_list();
5 FOREACH product x IN new product list DO
6 FOREACH product y IN product list DO
7 consensus = Consensus(product x, product y);
8 IF consensus THEN
9 List_insert_last(new insertion list, consensus);
10 product list = List_append(product list, new insertion list);
11 new product list = new insertion list;
12 product list = Eliminate_subsumed_products(product list);
13 product list = Select_covering_subset(product list);
14 RETURN product list;
Factorize(func list; sched)1 factor list = Empty_list();
2 FOREACH func x IN func list DO
3 FOREACH func y IN func list BEFORE func x DO
4 IF Factor_simpli�es(func y, func x) THEN
5 IF Resource_constrained(func x:id) THEN
6 IF NOT (List_member(func y, factor list) THEN
7 List_insert_last(factor list, func y);
8 func x = Factor_SOP(func x, func y);
9 FOR cycle = sched:min cycle TO sched:max cycle DO
10 FOREACH func IN func list DO
11 FOREACH product IN func DO
12 ready prod = Ready_product(product, cycle);
13 match prod = Match_term(ready prod, factor list);
14 IF match prod THEN
15 ready factor = match prod;
16 ELSE
17 ready factor = ready prod;
18 ready factor:id = Unique_token();
19 List_insert(factor list, ready factor);
20 Factor_term(product, ready factor);
21 List_insert_last(factor list, func list);
22 Factor_common_disjoint_subexpr(factor list, func list);
23 RETURN func list, factor list;
Factor_common_disjoint_subexpr(factor list, func list)1 FOREACH func IN func list DO
2 product factor list = Extract_ready_products (func);
3 fact func = Find_factor(product factor list, func);
4 IF fact func THEN
5 match fact = Match_factor(fact func, factor list);
6 IF NOT (match fact) THEN
7 fact func:id = Unique_token();
8 List_insert(factor list, fact func);
9 match fact = fact func;
10 Factor_term(func, match fact);
Figure 8.14: Pseudo-code for performing optimization of predicate expressions
110 A Step Towards Predicated Execution
Optimization of predicate expressions. Predicate expressions are optimized in two
steps, as indicated in Figure 8.14 in the description of Simplify_funcs. First, expressions are
reduced using condition BDD information. For example, conditions which imply or exclude each
other (i.e. (r1 < 4) implies (r1 < 5) and excludes (r1 >= 7)), can cause predicate expressions to
contain redundant or constant-FALSE products, as well as redundant literals in useful products.
These extraneous features are removed in this phase. One such case from the benchmark wc
was examined in Section 8.4.3.
Once redundant and constant-FALSE products and literals have been removed from the
predicate expressions, the iterative-consensus method is applied to produce a complete sum, and
then to select a subset of prime implicants for a simpli�ed two-level logic implementation [75].
Pseudo-code for this algorithm is shown in Figure 8.14 (Minimize_SOP). The heart of this
iterative algorithm is the consensus-taking routine, which applies the Boolean theorem x +
xy ! x + y. After each pass through the product list, products subsumed (covered) by other
products are removed. The iterative-consensus algorithm generates a complete sum for the
input expression. Non-essential products can then be removed to generate a minimal covering
sum.
In this application, the Boolean predicate expressions can be composed of a large number
of variables and products (more than thirty in some instances), rendering a direct implemen-
tation of the iterative-consensus algorithm, which is exponential, intolerably slow. For this
reason, when operating on large functions we apply an heuristic approximation to the iterative-
consensus method. This heuristic decreases dramatically the number of intermediate products,
and therefore renders the compile time reasonable. Furthermore, using this heuristic, the selec-
tion of the minimal sum-of-products expression (covering subset), also ordinarily an expensive
procedure, is reduced to a linear form.
The cost of this heuristic is that the result could be suboptimal, which could cause the
generation of expressions with more predicate de�ne instructions than necessary. Depending
on the order in which the comparisons are made, the heuristic may eliminate some products
that are necessary to generate other simpler products. To minimize this problem the heuristic
includes a manipulation which sorts the products in order to reduce the likelihood of a non-
optimal solution.
Figure 8.13(c) shows the expressions to which the essential predicates of the wc example
are reduced in the logic optimization phase. These expressions are both less complex and more
parallel than the original functions.
Two-level predicate synthesis. Following optimization of the predicate expressions, the
control logic can be synthesized most intuitively as a two-level predicate de�ne network which
directly evaluates the minimized sum-of-products expression. In this approach, two levels of
predicate de�ne instructions are used for each predicate. The �rst level consists of and-type
predicate de�nes of the form pi at = CihT i, where one predicate pi is de�ned for each product
term in the predicate expression, and T is the TRUE predicate, which always has the value
1. The second level consists of or-type predicate de�nes of the form pj ot = (condT )hpii,
where there is one such predicate de�ne for each product (pi) and condT is an invariant TRUE
condition (e.g. (0 == 0)). Thus, a predicate expression having L literals and M products
consumes M + 1 predicates and performs L +M predicate assignments. Continuing the wc
example in Figure 8.13(d), note that the two special cases of two-level predicate synthesis
occur, in which the computation of functions containing a single product and functions that
are disjunctions of single-literal products can be performed in a single cycle. Note also that
predicates which have products in common can share intermediate predicates, allowing for
some savings through reuse. In most cases, however, two-level synthesis generates an enormous
8.4 Control �ow optimization using predication 111
number of predicate de�ne instructions, since redundancy between products is not reduced.
Furthermore, since the evaluation of such a predicate de�ne network usually takes at least
two cycles after the last condition becomes available (one for the and-level and one for the
or-level), the result may also be suboptimal in latency, even when scheduled for in�nite issue.
Results demonstrating both these phenomena are presented in Section 8.4.6. Clearly, a more
sophisticated technique is required.
Factorization. In the example of the previous section, the code sample from wc exhibited
a large ratio of control height to computation height, and the computation was nearly completely
dependent on the outcome of the decision mechanisms. Thus, it was important to compress the
height of the entire decision structure as much as possible, as any reduction in the decision height
improved performance. Furthermore, since the predicate conditions were strongly related, the
resulting predicate de�ne structure actually reduced the predicate and predicate de�ne count.
In many other situations, however, predicates are based on more independent conditions and
the number of predicate de�ne instructions required to generate a two-level network may be
quite large. Factorization seeks to use the code's computation or datapath height to hide some
portions of the decision latency which are not on the critical path. Thus, the optimizer is free to
focus on reducing implementation size rather than delay when implementing these non-critical
sections, saving valuable predicate registers and instruction issue resources.
The factored generation method determines how much factoring can be performed at no
cost. The availability times of conditions and the time at which predicate values are needed by
the computation component drive the factorizer. If parallel computation height, rather than
predicate de�ne height, is the critical path through the code segment, then it is bene�cial to
perform factorization instead of full expression �attening.
To measure the availability times of conditions and the time at which predicate values are
needed, a special version of the code is scheduled. This version of the code has all the predicate
dependences between predicate de�nes removed. For each condition, a predicate destination
is added for each predicate whose function depends on that condition. In the resultant code,
predicate de�ne instructions are placed as early in the schedule as their condition availability
will allow. Also, all uses of a predicate are placed as early as possible, but after all the conditions
which may be needed to compute it. By extracting the issue time of these predicate de�nes
and predicate uses, the amount of time the new predicate network has to compute predicates
without performance penalty is ascertained. This information is then used together with the
previously extracted predicate expressions in later stages of optimization.
With factorization, the goal is to form intermediate predicates as the conditions to compute
them become available, and then to reuse these intermediate predicates in the computation of
the essential predicates. This activity factors the optimized sum-of-products expression or
its products so that the resulting de�ne structure may take more cycles, but can reuse more
intermediate predicates, thus saving predicate de�nes and predicate registers.
In certain cases, when resource utilization is very high and predicate functions are very
complex, factorization becomes critical for performance. In some cases, generation of code which
would optimally generate the predicate results on an in�nitely wide machine could actually
degrade performance in a real machine due to excessive width. In these situations, an additional
factorization preprocessing stage is applied, in which predicates are selectively factored on
subexpressions available in essential predicates generated earlier in the original code. This
activity, shown in lines 2 though 8 of Factorize in Figure 8.14, has the e�ect of moderating the
restructuring of control in cases where reordering of the predicate expressions would generate
a de�ne network too wide for the target architecture.
Figure 8.15 shows an example extracted from the function cofactor of the 008.espresso
112 A Step Towards Predicated Execution
Pred Expression Use Cycle
p1 C0C2C4C5+ 6
C0C2C3C5+
C0C1C5
p2 C0C2C4C5C6+ 7
C0C2C3C5C6+
C0C1C5C6
(a) Optimized predicate expressions.
C0 C1 C2 C3 C4 C5 C6
1 1 2 3 4 5 6
(b) Condition availability.
Time Predicate expression
1 p3 ut = C0
p4 at = C0
p4 at = C1
2 p5 ut = C2
p6 ut = C2 hp3i
3 p7 ut = C3 hp6i
4 p8 ut = C4 hp6i
5 p1 of = C5 hp7i
p1 of = C5 hp8i
p1 of = C5 hp4i
6 p2 ut = C6 hp1i
(c) Factoring with schedule time information.
Figure 8.15: Factorized predicate de�ne optimization.
benchmark. The minimal sum-of-products is computed for each of the �nal predicates, as
shown in Figure 8.15(a). Next, with the help of condition availability and predicate use times
from Figure 8.15(a) and 8.15(b), all useful predicates are factorized, and common expressions are
shared. Figure 8.15(c) shows the result of this method. This factoring results in the reduction
of the number of predicate de�ne instructions from 37 to 13. Furthermore, the useful predicates
(p1 and p2) are available a single cycle after the last condition is evaluated, sooner than would
be possible using a two-level synthesis of the predicate expressions, two cycles after the last
condition evaluation.
In the direct sum-of-products conversion, the computation of p1 and p2 begin respectively
at cycle 5 and cycle 6, at the availability time of their latest conditions; results are available two
cycles later. With the factorization method, however, predicates p1 and p2 can be evaluated
in a single cycle after the availability of C5 and C6. Thus, in some cases, the factorization
method is able to reduce predicate latency by one cycle compared to the result of the direct
sum-of-products conversion.
8.4 Control �ow optimization using predication 113
8.4.5 Architecture Support for Synthesis
Pred Expression Use Cycle
p1 C1 + C2 3
p2 C0C1C3 + C0C2C3 4
(a) Optimized predicate expressions.
C0 C1 C2 C3
1 2 2 3
(b) Condition availability.
Time Predicate expression
1 p2 ^t = C0
2 p1 ot = C1
p1 ot = C2
3 p2 ^t = C3 hp1i
(c) Factorization with conjunctive-type
predicate de�nes.
Time Predicate expression
1 p3 at = C0
p4 at = C0
2 p1 ot = C1
p1 ot = C2
p3 at = C1
p4 at = C2
3 p3 at = C3
p4 at = C3
4 p2 ot = TRUE hp3i
p2 ot = TRUE hp4i
(d) No factorization
Time Predicate expression
1 p2 at = C0
2 p1 ot = C1
p1 ot = C2
p3 af = C1
p3 af = C2
3 p2 at = C3
p2 af = TRUE hp3i
(e) Factorization without conjunctive-type
predicate de�nes.
Figure 8.16: Various methods of predicate expresssion regeneration.
Description of the predicate optimization in previous sections has disregarded the means by
which Boolean expressions are converted back into predicate de�ning instructions. This section
examines the instruction set considerations that evolved in supporting an e�ective predicate
synthesis system.
Implementation of two-level predicate synthesis is straightforward in the HPL Playdoh
predicate architecture. For example, in Figures 8.11 and 8.13(c), a simple sum-of-product
expression is converted into a small set of predicate de�nes.
Synthesis of multi-level factored functions is not as simple as product-of-sums or sum-of-
products expressions, but yields signi�cant improvements in both performance and predicate
de�ne count. When an expression is factored out of one or more predicate expressions, its value
is computed and stored in a predicate for later use. After factoring, expressions to be synthesized
thus contain predicates as well as conditions. To illustrate the use of factoring, the example in
Figure 8.16 is presented. In Figure 8.16(a), predicate p1 is a subexpression of p2. Factoring C1+
C2, or p1, out of p2 allows more sharing of predicate de�nes between predicate computations.
As can be seen in Figure 8.16(b), this subexpression can be computed in cycle 1 using or-type
predicate de�nes. The availability of this expression before the computation of p2 allow an
e�cient application of factorization. In cycle 3, the conjunction of the subexpression stored in
p1 with the previous value of p2 and C3 is required. This expression is awkward to compute
using the PlayDoh predicate de�ne semantics because the logical combination of predicates is
not directly supported. With the extension to the PlayDoh predicate de�ne semantics, this
114 A Step Towards Predicated Execution
expression can be computed with a single conjunctive-type predicate de�ne. Figure 8.16(c)
shows the �nal set of predicate de�nes used to compute the factored predicate expressions.
The two expressions are computed using a total of two predicates and four predicate de�nes.
The last predicate de�ne conjoins p1 and C3 to the previous contents of p2 (C0) to �nish the
computation of the p2 expression.
Bene�t of Architectural Extension. The primary use of the conjunctive-type predicate
de�nes is to reduce the number of instructions required to compute factored expressions. This
reduction is best illustrated when the generation of the predicate expressions is done without
the conjunctive type. Figures 8.16(d) and 8.16(e) show two generation options that do not use
the conjunctive type. In Figure 8.16(d), no factorization is performed and the direct sum-of-
products expressions are computed. This approach requires a total of ten predicate de�nes,
six more instructions than was required in Figure 8.16(c). Further, the two-level nature of
the sum-of-products generation adds an extra level of dependence height. In Figure 8.16(e),
factorization is performed, but the conjunctive-type is not used. Here, a total of seven predicate
de�nes, three extra instrutions, is necessary. Of these, two predicate de�nes are needed to
compute the complement of the factored expression. This is done by applying DeMorgan's
theorem. Another method of complementing p1 could have been used, but it would have cost a
cycle of latency. The third extra predicate de�ne is used to nullify p2 if the complement of the
factored predicate is TRUE. Note that the disjunctive-type predicate de�nes are analogously
useful when product-of-sum expressions are used.
8.4.6 Experimental Results
The e�ectiveness of the Boolean minimization techniques for generating predicated code are
evaluated in this section. These techniques have been implemented within the IMPACT exper-
imental compiler framework and applied to a set of benchmarks.
Processor Model and Benchmarks. The processor modeled is an 8-issue processor with
in-order execution and register interlocking. The processor has no limitation on the combination
of instructions that may be issued each cycle, except that only one branch may be executed per
cycle. The instruction latencies assumed match those of the HP PA-7100 microprocessor. The
instruction set contains a set of non-trapping versions of all potentially excepting instructions,
with the exception of branch and store instructions, to support aggressive speculative execution.
The instruction set also contains support for predicated execution as described in Section 8.1.
The execution time for each benchmark was obtained using the IMPACT emulation-driven
simulator. Some dynamic e�ects such as branch mispredictions, cache misses, and TLB misses
were not measured. This decision was made to ensure that the experimental results highlight
the e�ects of the techniques being evaluated. Since the reformulation of the predicate decision
logic does not a�ect the basic nature of memory access patterns and branch histories, any
change in these dynamic e�ects between the original and optimized codes would be spurious in
nature.
The benchmarks used in this experiment consist of 13 non-numeric programs: four of the
SPECINT 92 benchmarks, 008.espresso, 022.li , 026.compress, 072.sc; six of the SPECINT 95
benchmarks, 099.go, 124.m88ksim, 126.gcc, 129.compress , 130.li , 132.ijpeg ; and three UNIX
utilities, cccp, lex , wc.
Results. The �rst set of results presented compare the performance of a code set trans-
formed with the described techniques to the performance of a baseline code set. The baseline
code consists of the best code generated by the IMPACT compiler for a predicated architecture
using hyperblock compilation techniques. The transformed code corresponds to the baseline
8.4 Control �ow optimization using predication 115
1.37
1.00
1.05
1.10
1.15
1.20
1.25
1.30
008.
espr
esso
022.
li
026.
com
pres
s
072.
sc
099.
go
124.
m88
ksim
126.
gcc
129.
com
pres
s
130.
li
132.
ijpeg
cccp lex
wc
Spee
dup
8-issue
8-issue, 256-preds
Figure 8.17: Speedup from minimization of program decision logic.
hyperblock code after Boolean minimization techniques are used to restructure the predicate
de�nes, and after the code is rescheduled. Performance is derived by computing the ratio of the
execution cycle count for the baseline code to that of the transformed code. The performance is
examined at two levels, �rst at the overall benchmark level and then at the benchmark function
level.
The overall benchmark speedups are presented in Figure 8.17. For each benchmark, two
results are reported. The �rst is the benchmark speedup on the target architecture. The
unweighted average speedup for all the benchmarks is 1.13. For some benchmarks, such as 022.li,
026.compress, 129.compress, and wc, the program decision height was signi�cantly limiting
performance throughout the most frequently executed portions of the code; when this height is
reduced by our techniques, speedups of around 1.2 are achieved.
The second result presented for each benchmark, labeled �8-issue, 256-preds,� is the speedup
on a hypothetical machine capable of issuing eight non-predicate-de�ne instructions and up to
256 predicate de�nes per cycle. The signi�cance of the second set of numbers is that they
re�ect only the dependence height of predicate de�nes, while eliminating their resource con-
sumption characteristics. These results suggest a logical upper bound for gains possible with
more e�ective factorization techniques. In most benchmarks, the optimizer produced a number
of predicate de�nes that was appropriate for the schedule and machine model. However, in four
benchmarks, 008.espresso, cccp, 126.gcc, and lex, the optimizer was unable to balance height
reduction with resource consumption and performance was penalized. This e�ect was very dra-
matic in 008.espresso because it is very decision height limited. Unfortunately, the excessive
optimization opportunity available in 008.espresso allowed the current minimization heuristic
to be overly aggressive in reducing height. With more advanced factorization techniques, the
number of predicate de�nes could be reduced in these instances, more closely approximating
the �8-issue, 256-preds� results.
Overall, the full benchmark results are encouraging. In most cases, the bene�t of our
technique was limited solely by the bottleneck created by program computation height. During
our experimental exploration, we observed that as optimizations which target computation
height were improved, the decision logic became dominant and relative speedups improved. In
particular, data and memory dependences seemed to hide much of the program decision height
reduction in many important hyperblocks. As the various components of compiler technology
116 A Step Towards Predicated Execution
Original Two-Level Synthesis Factored Synthesis
Benchmark, Function #pdi #pdi S(1) S(8) #pdi S(1) S(8)
008.espresso, essen_parts 39 1293 1.29 0.39 49 1.24 1.16
022.li, xleval 48 485 1.07 0.66 80 1.10 1.10
022.li, mark 42 67 1.48 1.48 53 1.50 1.48
026.compress, compress 60 456 1.20 1.03 221 1.23 1.23
072.sc, update 141 240 1.15 1.15 159 1.23 1.23
099.go, gete�ibs 98 1083 1.06 0.98 204 1.07 1.07
124.m88ksim, execute 41 47 1.12 1.12 40 1.12 1.12
124.m88ksim, goexec 176 175 1.10 1.09 155 1.09 1.08
124.m88ksim, load_data 42 54 1.30 1.30 53 1.30 1.30
124.m88ksim, loadmem 84 88 1.13 1.13 84 1.13 1.13
126.gcc, invalidate 89 202 1.27 1.24 125 1.22 1.21
126.gcc, �ow_analysis 64 92 1.77 1.69 58 1.86 1.86
126.gcc, canon_hash 89 149 1.88 1.20 116 1.90 1.74
129.compress, compress 63 154 1.21 1.21 98 1.26 1.26
130.li, mark 55 148 1.15 1.14 101 1.19 1.19
132.ijpeg, forward_DCT 31 47 1.46 1.35 32 1.46 1.43
cccp, skip_if_group 157 208 1.23 1.05 190 1.32 1.24
lex, cgoto 236 330 1.31 1.10 260 1.18 1.14
wc, main 56 48 1.22 1.31 48 1.22 1.22
Table 8.4: Speedup and predicate de�ne count for selected functions.
mature, the overall e�ectiveness of Boolean minimization will improve.
To better understand the e�ect program decision logic minimization has on complete
programs, we measured the performance and code size characteristics of a number of selected
functions. Table 8.4 examines the performance of one or more functions from each of the
benchmarks. These functions were chosen based on two criteria: signi�cant program execution
time and potential for optimization (e.g., the control height was signi�cant relative to the
computation height). The table compares the e�ectiveness of two strategies for program logic
transformation: two-level predicate synthesis and factorization. For each strategy, the static
number of predicate de�ne instructions, the performance gain on an 8-issue processor with
unconstrained predicate de�ne resources (1), and the performance gain on the 8-issue processor
are reported. In addition, the static number of predicate de�ne instructions in the code before
minimization is reported.
From the table, the two-level synthesis approach shows mixed results. For the uncon-
strained machine, the reduction in height translates directly into large speedups. However, the
unconstrained performance does not always translate into the same performance gain on the
8-issue processor. This is most pronounced in 008.espresso, essen_parts where the 1.16 speedup
is sharply reduced to 0.39. The primary reason for this behavior is the large increases in the
number of predicate de�ne instructions. The predicate de�nes that are created oversaturate
the processor resources and result in loss of performance. Correspondingly, when the number
of predicate de�nes is not increased by a large amount, the unconstrained performance does
indeed translate directly into performance on the 8-issue processor. Clearly, factored synthesis
is necessary for successful optimization of program decision logic.
As shown in the table, the factored approach yields both larger and more consistent
speedups. Both methods reduce the predicate computation height, but the factored approach
dramatically reduces the number of predicate de�nes required for the optimization. The func-
tion 126.gcc, canon_hash provides a good example of this behavior. Both methods achieve good
speedup for the unconstrained processor. However, the two-level synthesis approach requires
149 predicate de�nes to accomplish the improvement. For the 8-issue processor, most of the
8.5 Conclusion 117
performance gain is lost due to this increase in instructions. The factored approach reduces
the number of predicate de�nes to 116, increasing the 8-issue speedup to 1.74. The number
of predicate de�nes is still more than the original 89. Note, however, that simply increasing
the number of predicate de�nes from the original code is not necessarily viewed as a negative.
Boolean minimization approaches do this systematically to improve performance by identifying
condition subexpressions that can be computed early. This allows the �nal predicate to be
made available as soon as possible after the �nal condition is ready. However, the factored ap-
proach is consistently more e�ective because it factors predicate expressions into multiple-level
structures which are less demanding of processor resources than two-cycle evaluations. Another
interesting result is that for some functions such as update from 072.sc the factored synthesis
method outperforms the two-level method, even at in�nite issue. This is a due to the abil-
ity of the factorizer to generate expressions in one cycle rather than the two usually required
by the two-level synthesis approach. The �nal experiment examines the e�ectiveness of the
new predicate types (conjunctive and disjunctive, described in Section 8.1) in the context of
Boolean minimization and justi�es the need for the proposed architectural extensions. Table 8.5
presents the e�ects of the new predicate de�ne types on the speedup for an 8-issue processor,
the dynamic predicate de�ne count, and the static predicate de�ne count. The conjunctive and
disjunctive types allow certain important logical combinations of predicates and conditions to
be expressed more e�ciently. For all functions except 022.li, mark and 130.li, mark, the per-
formance gained from the program decision logic optimization is diminished when the proposed
predicate de�ne types are not available. Further, in six of the nineteen functions, the perfor-
mance improvement is converted into a performance loss. The most dramatic example of this is
126.gcc, �ow_analysis, in which a 46% performance improvement becomes an 8% performance
degradation. The lack of the new predicate de�ne types in the target architecture also causes a
code size penalty. In general, the additional predicate types allow signi�cant reductions in both
the static and dynamic predicate de�ne counts. In one case, 74% more predicate de�nes are
required if the new types are not available. Six functions do not exhibit this penalty. In these
functions, the majority of the predicate expressions are sums of single term �products� making
the conjunctive-type unnecessary for instantiating these functions.
8.5 Conclusion
This chapter gave the potential bene�ts that can lead the introduction of predicated execution
into embedded processor in terms of both control �ow optimization and code size issue.
The proposed pre�x-based predicated execution architecture framework has the potential
to signi�cantly enhance the e�ectiveness of introducing predicated execution into embedded
microprocessors. For regions of non-predicated code, the pre�x-based method o�ers better
code density characteristics than traditional models of predication support. For predicated
regions, the pre�x-based method o�ers performance improvement over an architecture without
predication support. It was illustrated that an optimizing compiler can enhance the pre�x-
based predication model by performing aggressive instruction merging and predicate promotion
to reduce the number of predicated instructions by 30%. Overall, pre�x-based predication
achieves 12% performance improvement for code created with superscalar optimization and
reduces code size by 25%.
Also, a new method for optimizing programmatic control �ow was presented. This ap-
proach provides a systematic methodology for reformulating program control �ow for more
e�cient execution on ILP processors. Control expressed through branches and predicate de-
�nes is extracted and represented as a program decision logic network . Boolean minimization
118 A Step Towards Predicated Execution
Pred. Def. Count
Speedup (8) Penalty w/o ^t/^f
Benchmark, Function with without dynamic static
008.espresso, essen_parts 1.16 0.96 17.2% 17.8%
022.li, xleval 1.10 1.08 35.4% 35.0%
022.li, mark 1.48 1.48 11.5% 11.3%
026.compress, compress 1.23 1.13 59.8% 60.2%
072.sc, update 1.23 0.98 4.3% 5.0%
099.go, gete�ibs 1.07 1.06 17.1% 21.1%
124.m88ksim, execute 1.12 0.89 16.9% 10.0%
124.m88ksim, goexec 1.08 0.90 6.3% 6.5%
124.m88ksim, load_data 1.30 1.07 15.3% 11.3%
124.m88ksim, loadmem 1.13 1.02 74.1% 14.3%
126.gcc, invalidate 1.14 0.77 30.3% 22.4%
126.gcc, �ow_analysis 1.86 0.93 0.1% 0.0%
126.gcc, canon_hash 1.74 1.60 11.4% 10.5%
129.compress, compress 1.26 1.10 53.4% 35.7%
130.li, mark 1.19 1.19 18.2% 17.8%
132.ijpeg, forward_DCT 1.43 1.33 0.0% 0.0%
cccp, skip_if_group 1.24 1.20 16.8% 14.2%
lex, cgoto 1.14 1.07 4.7% 10.8%
wc, main 1.22 1.16 4.2% 4.2%
Table 8.5: E�ects of conjunctive-type predicate de�nes on speedup and instruction count.
techniques are applied to the network both to reduce dependence height and to simplify the
component expressions. Redundancy is controlled by employing a schedule-sensitive factoriza-
tion technique to identify intermediate logical combinations of conditions that can be shared.
After optimization, the network is reformulated into predicated code.
An extension to the HPL PlayDoh model of predication that allows more e�cient com-
putation of the predicate expressions produced by the minimization techniques, namely the
conjunctive and disjunctive predicate assignment types was also presented. Experimental re-
sults show that in blocks of predicated code with signi�cant control height, the application of
logic minimization techniques together with these architectural enhancements provides substan-
tial performance bene�t. Across the benchmarks studied, program decision logic minimization
provided an average overall speedup of 1.13 for an 8-issue processor. The new predicate assign-
ment types were also shown to signi�cantly reduce the number of predicate de�ne instructions
required. As compiler technology progresses to make more extensive and e�ective use of pred-
icated code, minimization of program decision logic is likely to become an increasingly more
important part of total program optimization.
Chapter 9
Conclusion
This thesis investigated the bene�ts of synergistic hardware-compiler ILP architectures for
low-power processors. New solutions were proposed to integrate multiple-issue pipelines
into mobile architecture, and a detailed analysis of the de�ned systems was done.
Chapter 2 gave a brief survey of the major ILP techniques that are used to enhance
performance. The main concepts of ILP were introduced and several architectures that exploit
ILP were described. Furthermore, the compiler support for such architectures was presented.
Chapter 3 strongly motivated the use of parallelism for the design of an energy e�cient
microprocessor. First, it gave an introduction to power consumption in CMOS circuits. Then,
several metrics and their meaning were described. Finally, the e�ect of parallelism on such
metrics was explained.
At this point an overview of the state of the art in low-power 32-bit mobile processors was
given in Chapter 4. It points out that, generally, for embedded processors, ILP is exploited
only through pipelining techniques, and, surprisingly, VLIW architectures have not yet been
introduced in the mobile processor market even though their inherent simplicity can o�er low
power consumption and improved performance compared to scalar architectures.
Chapter 5 described a high-level evaluation of the bene�ts of VLIW for low-power proces-
sors. It was demonstrated, through the use of high-level power consumption estimates, that the
introduction of VLIW architectures into low-power embedded 8-bit or 16-bit microcontrollers
yields a signi�cant improvement of the energy e�ciency during inner loop execution.
Motivated by these experimental results, Chapter 6 proposed a new VLIW architecture
called, DEVIL, that targets the low-power mobile processor market. DEVIL includes a new fetch
mechanism that encodes explicitly the parallelism within an instruction bundle and supports
a variable instruction mechanism. It was demonstrated that this mechanism allows savings of
up to 50% in the code size as compared to a standard VLIW fetch mechanism while keeping
performance unchanged. This is an important result since the cost, a central point of embedded
systems, depends directly on the code size. Furthermore, this fetch mechanism allows a signi�-
cant reduction of the memory tra�c (approximately 16%), proportionally decreasing the power
consumption required for the instruction fetches. The e�ect of superscalar optimizations on
performance were also investigated. It was shown that superscalar optimizations are required
to achieve good performance levels, implying a big increase in code size (58% on average) due
to code duplication (e.g., tail duplication). The e�ects of code size expansion are minimized
through the compaction technique o�ered by the DEVIL architecture. Note that the compiler
was not tuned to minimize code size and that DEVIL's partial predication support was not
used, meaning this code size can be much more optimized. In terms of performance, DEVIL
speeds up the execution time by a factor of 1.5 on average as compared to a scalar processor.
119
120 Conclusion
This performance enhancement allows lower clock frequencies and power supply voltages, thus
reducing the circuit's power consumption.
In order to obtain accurate estimates of DEVIL's features in terms of complexity, circuit
speed, and power consumption, Chapter 7 described an implementation of the DEVIL proces-
sor. Using this implementation, estimates of the circuit's complexity, speed, and circuit power
consumption were computed, so as to complete the evaluation the bene�ts of VLIW architec-
tures for low-power processors. In terms of circuit speed, DEVIL runs at 50 MHz, which is
quite slow for a 0.25� technology. This is due to the synthesis methodology, as well as the lack
of resources to optimize DEVIL's datapath. The complexity of DEVIL was estimated to be
around 125'000 transistors, categorizing DEVIL as a simple circuit that should have a small
die area; this shows that the simplicity of VLIW architectures is well adapted to embedded
systems. It was also shown that the dispatch unit introduced to handle the variable instruc-
tion length increases the circuit complexity by only 4%, which is negligible compared to the
demonstrated bene�ts of such a mechanism. According to the estimated features of the design
and of the overhead introduced by the VLIW architecture, it was shown that parallelism allows
an improvement of the energy e�ciency of about 38% on average. This confers to DEVIL the
attractive possibility to execute code at the same speed as that of a scalar processor while
consuming much less power.
These results clearly validate the bene�ts of VLIW architectures for low-power mobile pro-
cessors. The major drawback is the code size penalty caused by the use of high-level languages
and superscalar optimizations. Chapter 8 made a �rst step toward predicated execution for
embedded processor and proposed new solutions to optimize the code size and the control �ow.
A pre�x-based predicated execution architecture framework has been proposed that has
the potential to signi�cantly enhance the e�ectiveness of introducing predicated execution into
embedded microprocessors. Overall, pre�x-based predication achieves a 12% performance im-
provement for code created with superscalar optimizations and reduces code size by 25%. Also,
a new method for optimizing a program's control �ow was presented. This approach provides
a systematic methodology for reformulating program control �ows for more e�cient execution
on ILP processors. Control expressed through branches and predicate de�nes is extracted and
represented as a program decision logic network . Boolean minimization techniques are applied
to the network both to reduce dependence height and to simplify the component expressions.
Redundancy is controlled by employing a schedule-sensitive factorization technique to identify
intermediate logical combinations of conditions that can be shared. After optimization, the
network is reformulated into predicated code.
An extension to the HPL PlayDoh model of predication that allows more e�cient com-
putation of the predicate expressions produced by the minimization techniques, namely the
conjunctive and disjunctive predicate assignment types, was also presented. Experimental re-
sults show that, in blocks of predicated code with signi�cant control height, the application of
logic minimization techniques together with these architectural enhancements provides a sub-
stantial performance bene�t. Across the benchmarks studied, program decision logic minimiza-
tion provided an average overall speed-up of 1.13 for an 8-issue processor. The new predicate
assignment types were also shown to signi�cantly reduce the number of predicate de�ne instruc-
tions required. As compiler technology progresses to make more extensive and e�ective use of
predicated code, the minimization of program decision logic is likely to become an increasingly
important part of the total program optimization.
Conclusion 121
Future Work
During this thesis, a �rst prototype of a synergistic processor-compiler system was built. This
prototype allowed to demonstrate the bene�ts of ILP architectures for low-power mobile pro-
cessors. However, the road is long before the fabrication of a commercial product can become
feasible.
DEVIL's design is currently a prototype and needs many more time-consuming design
optimizations. The current design estimates will allow to direct power consumption and critical
path reduction optimizations. Such optimizations should result in a lower power consumption
and a higher circuit speed, conferring to DEVIL a very attractive energy-e�ciency/performance
ratio. A place and route of the circuit should then be realized to obtain accurate transistor-level
estimates.
The current implementation of DEVIL su�ers from low circuit speed of 50 MHz. Even
with further optimizations, the maximum speed will be limited by the 3-stage pipeline. An
interesting step would be to loosen this constraint by increasing DEVIL's pipeline depth from
3 to 5 stages, allowing the processor to reach much higher clock frequencies. However, the
increase in complexity in terms of branch prediction and operand bypassing should be carefully
considered. From the estimates extracted from the DEVIL implementation, a 5-stage pipeline
would have a good potential to improve the energy e�ciency of DEVIL.
As stated at the beginning of this work, DEVIL is to be used in synergy with a compiler.
All the bene�ts of the DEVIL architecture rely on the quality of the code generated by the
compiler (remember that the parallelism is extracted at compile time). The current DEVIL's
compiler is based on a new back-end for the IMPACT compiler, allowing it to generate high-
performance code. However, current optimizations incur a code size penalty that can a�ect the
total system cost, and therefore an e�ort should be made to optimize the IMPACT back-end
to DEVIL to generate smaller code while keeping the performance at a same level.
Currently not all the features o�ered by DEVIL are supported by the back-end. For exam-
ple, DEVIL has a partial predication support that can lead to a signi�cant code size reduction
with no performance loss. This feature can be used, for example, to avoid some tail duplications
during superblock formation. Also, the possibility of having operations that shift one of their
operands at no cost is not yet supported, but could also increase the code density.
DEVIL implements a very simple instruction set, allowing a fast and low-power pipeline.
However, the lack of complex instructions, such as multiply and accumulate or count lead-
ing zeros, sometimes results in a severe performance degradation. As DEVIL use a synthesis
methodology, an interesting approach would be to reserve some of the available opcodes for in-
struction set specialization. The idea is to extend DEVIL to detect these de�nable instructions
and to send them to a dedicated datapath. The processor should be resinthesized with the new
functionalities. This addition of an extra unit should be easy thanks to the modularity of the
VLIW architecture.
There are two ways to exploit this specialization mechanism. The �rst, and probably the
simplest one, is to have a library of coprocessor units that implement dedicated functions in
hardware, along with the compiler support to generate code using these new instructions. An
embedded system designer, knowing the requirements of the targeted application, could then
customize DEVIL and its compiler using the prede�ned library elements. Once the system
is tested and simulated, a new chip could be implemented with the adapted functionalities.
122 Conclusion
The second method is much more challenging and goes in the direction of software-hardware
codesign. The idea is to let the compiler decided how DEVIL's instruction set should be cus-
tomized. Once the decision is made, the compiler should generate the assembly code including
the custom instructions as well as the VHDL code that implements the custom functional unit.
In conclusion, ILP helps to improve the energy e�ciency of low-power embedded proces-
sors. The results obtained with our �rst prototype are very satisfactory and many improvements
are not only possible, but also fairly simple to introduce, motivating further investigations and
developments in this direction.
Appendix A
The DEVIL's Instruction Set Summary
A.1 Functions De�nition
<c> The operation can be conditionnally executed on T or T
SHF(op) The operand op can be shifted, all the shifter functionalities are available
Sext(op) The operand op is sign extended
Zext(op) The operand op is extended with zeros
Oext(op) The operand op is extended with ones
Zend(op) The operand op is ended with zeros
Oend(op) The operand op is ended with ones
123
124 The DEVIL's Instruction Set Summary
A.2 Arithmetical Operations
15-bit version 30-bit version Description
subi rsd, imm:5 subi rd, rs, imm:16 subtract immediat (unsigned)
rsd = rsd - imm:5 rd = rs - imm:16
addi rsd, imm:5 addi rd, rs, imm:16 add immediat (unsigned)
rsd = rsd + imm:5 rd = rs + imm:16
shli rsd, imm:5 shli rd, rs, imm:5 logical shift left
rsd = rsd � imm:5 rd = rs � imm:5 <c>
shri rsd, imm:5 shri rd, rs, imm:5 logical shift right
rsd = rsd � imm:5 rd = rs � imm:5 <c>
ashri rsd, imm:5 ashri rd, rs, imm:5 arithmetical shift right
rsd = rsd � imm:5 rd = rs � imm:5 <c> signed
sub rsd, rs sub rd, rs1, rs2 subtract registers
rsd = rsd - rs rd = SHF(rs1) - rs2 <c>
add rsd, rs add rd, rs1, rs2 add register
rsd = rsd + rs rd = SHF(rs1) + rs2 <c>
subsp imm:8 subsp imm:20 subtract immediate value to SP
SP = SP - imm:8 SP = SP - imm:20 unsigned
addsp imm:8 addsp imm:20 add immediate value to SP
SP = SP + imm:8 SP = SP + imm:20 unsigned
shl rsd, rs shl rd, rs1, rs2 logical shift left
rsd = rsd � rs rd = rs1 � rs2 <c>
shr rsd, rs shr rd, rs1, rs2 logical shift right
rsd = rsd � rs rd = rs1 � rs2 <c>
ashr rsd, rs ashr rd, rs1, rs2 arithmetical shift right
rsd = rsd � rs rd = rs1 � rs2 <c> signed
neg rd, rs neg rd, rs two-complement
rd = -rs rd = -rs <c>
cast.8 rd, rs cast.8 rd, rs cast to 8-bit unsigned
rd = ZeroExt(rs.8) rd = Zext(rs.8) <c>
cast.16 rd, rs cast.16 rd, rs cast to 16-bit unsigned
rd = ZeroExt(rs.16 rd = Zext(rs.16) <c>
ext.8 rd, rs ext.8 rd, rs byte (8-bit) sign extention
rd = SignExt(rs.8) rd = Sext(rs.8) <c>
ext.16 rd, rs ext.16 rd, rs word (16-bit) sign extension
rd = SignExt(rs.8) rd = Sext(rs.8) <c>
nop nop no operation
Table A.1: DEVIL's arithmetical instructions
A.3 Logical Operations 125
A.3 Logical Operations
15-bit version 30-bit version Description
- ori.l rd, rs, imm:16 logical OR
rd = rs | Zext(imm:16)
- andi.l rd, rs, imm:16 logical AND
rd = rs & Oext(imm:16)
- xori.l rd, rs, imm:16 logical XOR
rd = rs xor Zext(imm:16)
- ori.h rd, rs, imm:16 logical OR
rd = rs | Zend(imm:16)
- andi.h rd, rs, imm:16 logical AND
rd = rs & Oend(imm:16)
- xori.h rd, rs, imm:16 logical XOR
rd = rs xor Zend(imm:16)
or rsd, rs or rd, rs1, rs2 logical OR
rsd = rsd | rs rd = SHF(rs1) | rs2 <c>
xor rsd, rs xor rd, rs1, rs2 logical XOR
rsd = rsd xor rs rd = rs1 xor rs2
and rsd, rs and rd, SHF(rs1), rs2 <c> logical AND
rsd = rsd & rs rd = rs1 & rs2
not rd, rs not rd, rs bit inversion
rd = not(rs) rd = not(rs) <c>
Table A.2: DEVIL's logical instructions
126 The DEVIL's Instruction Set Summary
A.4 Compare Operations
15-bit version 30-bit version Description
testi.eq rs, imm:5 testi.eq rs, imm:20 test if equal
T = (rs == Sext(imm:5)) T =(rs == Sext(imm:20)) signed
testi.lt rs, imm:5 testi.lt rs, imm:20 test if less
T = (rs == Sext(imm:5)) T =(rs == Sext(imm:20)) signed
testi.le rs, imm:5 testi.le rs, imm:20 test if less or equal
T = (rs == Sext(imm:5)) T =(rs == Sext(imm:20)) signed
testi.sm rs, imm:5 testi.sm rs, imm:20 test if smaller
T = (rs == Zext(imm:5)) T =(rs == Zext(imm:20)) unsigned
testi.ss rs, imm:5 testi.ss rs, imm:20 test if smaller or equal
T = (rs == Zext(imm:5)) T =(rs == Zext(imm:20)) unsigned
btest rs, imm:5 btest rs, imm:5 bit test
T = rs(bit(imm:5)) T = rs(bit(imm:5))
test.eq rs1, rs2 test.eq rs1, rs2 test if equal
T = (rs1 == rs2) T = (SHF(rs1) == rs2) <c>
test.lt rs1, rs2 test.lt rs1, rs2 test if less
T = (rs1 < rs2) T = (SHF(rs1) < rs2) <c> signed
test.le rs1, rs2 test.le rs1, rs2 test if less or equal
T = (rs1 <= rs2) T = (SHF(rs1) <= rs2) <c> signed
test.sm rs1, rs2 test.sm rs1, rs2 test if smaller
T = (rs1 < rs2) T = (SHF(rs1) < rs2) <c> unsigned
test.ss rs1, rs2 test.ss rs1, rs2 test if smaller or equal
T = (rs1 <= rs2) T = (SHF(rs1) <= rs2) <c> unsigned
Table A.3: DEVIL's compare instructions
A.5 Move Operations 127
A.5 Move Operations
15-bit version 30-bit version Description
mov rd, rs mov rd, rs move register
rd=rs rd=rs
ldi rd, imm:6 ldi rd, imm:20 move immediate
rd = Sext(imm:6) rd = Sext(imm:20) signed
cmovt rd, rs cmovt rd, rs conditional move
rd=rs if T rd=rs if T
cmovnt rd, rs cmovnt rd, rs conditional move
rd=rs if not(T) rd=rs if not(T)
movt rd movt rd move T �ag to reg
rd = Zext(T) rd = Zext(T)
movnt rd movnt rd move not(T) �ag to reg
rd = Zext(not(T)) rd = Zext(not(T))
mov2mac mac, rs mov2mac rs move reg to macro register
mac = rs mac = rs
movmac rd, mac mov2mac rd, mac move macro register to reg
rd = mac rd = mac
ion ion enable interuptions
io� io� disable interuptions
Table A.4: DEVIL's move instructions
128 The DEVIL's Instruction Set Summary
A.6 Branch Operations
15-bit version 30-bit version Description
jt_nt disp:10 jt_nt disp:25 jump if T, nullify next if taken
jt_nn disp:10 jt_nn disp:25 jump if T, nullify next if not taken
jnt_nt disp:10 jnt_nt disp:25 jump if not(T), nullify next if taken
jnt_nn disp:10 jnt_nn disp:25 jump if not(T), nullify next if not taken
jmp disp:10 jmp disp:25 unconditional jump
jsr disp:10 jsr disp:25 jump subroutine, save PC
jt_nt rs jt_nt rs jump if T, nullify next if taken
jt_nn rs jt_nn rs jump if T, nullify next if not taken
jnt_nt rs jnt_nt rs jump if not(T), nullify next if taken
jnt_nn rs jnt_nn rs jump if not(T), nullify next if not taken
jmp rs jmp disp:25 unconditional jump
jsr rs jsr disp:25 jump subroutine, save PC
ret ret return from subroutine
reti reti return fm interrupt
Table A.5: DEVIL's branch instructions
A.7 Data Memory Operations 129
A.7 Data Memory Operations
15-bit version 30-bit version Description
ld.8u rd, rs ld.8u rd, rs1, rs2 load unsigned byte
rd = Zext(mem.8[rs]) rd = Zext(mem.8[rs1 + rs2])
ld.8 rd, rs ld.8 rd, rs1, rs2 load signed byte
rd = Sext(mem.8[rs]) rd = Sext(mem.8[rs1+rs2])
ld.16u rd, rs ld.16u rd, rs1, rs2 load unsigned half word
rd = Zext(mem.16[rs]) rd = Zext(mem.16[rs1+rs2])
ld.16 rd, rs ld.16 rd, rs1, rs2 load signed half word
rd = Sext(mem.16[rs]) rd = Sext(mem.16[rs1+rs2])
ld.32 rd, rs ld.32 rd, rs1, rs2 load signed word
rd = mem.32[rs] rd = mem.32[rs1 + rs2]
st.8 rs1, rs2 st.8 rs1, rs1, rs3 store byte
mem.8[rs1] = rs3 mem.8[rs1+s2] = rs3
st.16 rs1, rs2 st.16 rs1, rs1, rs3 store half word
mem.16[rs1] = rs2 mem.16[rs1+rs2]=rs3
st.32 rs1, rs2 st.32 rs1, rs1, rs3 store word
mem.32[rs1] = rs2 mem.32[rs1+rs2]=rs3
ld.8u rd, imm:5 ld.8 rd, rs1, imm:16 load signed byte
rd = Zext(mem.8[SP+imm:5]) rd = Sext(mem.8[rs1+imm:16])
ld.8 rd, rs ld.16u rd, rs1, rs2 load unsigned half word
rd = Sext(mem.8[SP+imm:5]) rd = Zext(mem.16[rs1+imm:16])
ld.16u rd, rs ld.16 rd, rs1, rs2 load signed half word
rd = Zext(mem.16[SP+2*imm:5]) rd = Sext(mem.16[rs1+2*imm:16])
ld.32 rd, rs ld.32 rd, rs1, rs2 load signed word
rd = mem.32[SP+4*imm:5] rd = mem.32[rs1 + 4*imm:16]
st.8 rs1, rs2 st.8 rs1, rs1, rs3 store byte
mem.8[SP+imm:5] = rs3 mem.8[rs1+imm:16] = rs3
st.16 rs1, rs2 st.16 rs1, rs1, rs3 store half word
mem.16[SP+2*imm:5] = rs2 mem.16[rs1+2*imm:16]=rs3
st.32 rs1, rs2 st.32 rs1, rs1, rs3 store word
mem.32[SP+4*imm:5] = rs2 mem.32[rs1+4*imm:16]=rs3
Table A.6: DEVIL's load/store instructions
130 The DEVIL's Instruction Set Summary
15-bit version 30-bit version Description
- ld.8 rd, label:15 load signed byte
rd = Sext(mem.8[label:15])
- ld.16u rd, label:15 load unsigned half word
rd = Zext(mem.16[label:15])
- ld.16 rd, label:15 load signed half word
rd = Sext(mem.16[label:15])
- ld.32 rd, label:15 load signed word
rd = mem.32[label:15]
- st.8 rs, label:15 store byte
mem.8[label:15] = rs
- st.16 rs, label:15 store half word
mem.16[label:15]=rs
- st.32 rs, label:15 store word
mem.32[label:15]=rs
Table A.7: DEVIL's load/store instructions (second part)
Bibliography
[1] Chart Watch: Mobile Processors. Microprocessor Report, March 29, 1999.
[2] Chart Watch: Workstation Processors. Microprocessor Report, January 25, 1999.
[3] Leadership in DSP Technology for Communication Applications. http://starcore-dsp.com,
1999.
[4] Motoroloa and Lucent Unveil First O�ering From Star*Core Joint DSP Design Team.
press release, April 1999. http://starcore-dsp.com.
[5] S. B. Akers. Binary Decision Diagrams. IEEE Transaction on Computers, C-27(8):509�516,
June 1978.
[6] J. R. Allen, K. Kennedy, C. Porter�eld, and J. Warren. Conversion of control depen-
dence to data dependence. In Proceedings of the 10th ACM Symposium on Principles of
Programming Languages, pages 177�189, January 1983.
[7] D. I. August, D. A. Connors, S. A. Mahlke, J. W. Sias, K. M. Crozier, B. Cheng, P. R.
Eaton, Q. B. Olaniran, and W. W. Hwu. Integrated predication and speculative execution
in the IMPACT EPIC architecture. In Proc. of the 25th International Symposium on
Computer Architecture, June 1998.
[8] D. I. August, W. W. Hwu, and S. A. Mahlke. A framework for balancing control �ow and
predication. In Proceedings of the 30th Annual International Symposium on Microarchitec-
ture, December 1997.
[9] Eduard Ayguade, Cristina Barrado, Antonio Gonzalez, Jesus Labarta, David Lopez, Josep
Llosa, Susana Moreno, David Padua, Fermin J. Reig, and Mateo Valero. Ictineo: A Tool
for Research on ILP. In Supercomputing'96, November 1996.
[10] Rich Belgard. Transmeta Exposed. Microprocessor Report, March 8, 1999.
[11] M. Berry, D. Chen, P. Koss, and D. Kuck. The Perfect Club Benchmarks: E�ective Perfor-
mance Evaluation of Supercomputers. Technical Report 827, Center for Supercomputing
Research and Development, November 1988.
[12] R. Bodik, R. Gupta, and M. L. So�a. Interprocedural conditional branch elimination. In
Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design
and Implementation, pages 146�158, June 1997.
[13] R. E. Bryand. Graph-based algorithms for boolean function manipulation. IEEE Trans-
action on Computers, C-35(8):677�691, August 1986.
131
132 Bibliography
[14] R. E. Bryant. Symbolic boolean manipulation with ordered binary decision diagrams. Tech-
nical Report CMU-CS-92-160, Carnegie Mellon University, School of Computer Science,
Carnegie Mellon University, Pittsburgh, PA, October 1992.
[15] T. D. Burd and R. W. Brodersen. Processor Design for Portable Systems. Journal of VLSI
Signal Processing, 13(2/3):203�222, August/September 1996.
[16] Thomas D. Burd and Robert W. Brodersen. Energy e�cient CMOS microprocessor design.
In Proceedings of the 28th Annual HICSS Conference, volume 1, pages 288�297, January
1995.
[17] Brian Case. Philips Hopes to Displace DSPs with VLIW. Microprocessor Report, 8(16),
December 1994.
[18] G.H. Chaitin. Register Allocation and Spilling Via Graph Coloring. In Proc., ACM SIG-
PLAN Symp. on Compiler Construction, pages 98�105, June 1982.
[19] Anantha P. Chandrakasan and Robert W. Brodersen. Low Power Digital CMOS Design.
Kluwer Academic Publisher, 1995.
[20] Pohua P. Chang, Scott A. Mahlke, and Wen mei W. Hwu. Using Pro�le Information to As-
sist Classic Compiler Code Optimizations. Software Practice and Experience, 21(12):1301�
1321, December 1991.
[21] Enric Musoll Cinca. High-Level and Logic Synthesis Techniques for Low Power. PhD
thesis, Universitat Politènica de Catalunya, July 1996.
[22] V. Kathail et al. HPL PlayDoh architecture speci�cation: Version 1.0. Technical Report
HPL-93-80, Hewlett-Packard Laboratories, Palo Alto, CA, February 1994.
[23] Joseph A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE
Transansaction on Computers, c-30:478�490, July 1981.
[24] Ricardo Gonzalez and Mark Horowitz. Energy Dissipation In General Purpose Micropro-
cessors. IEEE Journal of Solid-State Circuits, 31(9):1277�1283, September 1996.
[25] Linley Gwennap. Intel, HP Make EPIC Disclosure. Microprocessor Report, 11(14), October
1997.
[26] Linley Gwennap. ARM10 Points to Set-Tops, Handhelds. Microprocessor Report, Novem-
ber 16, 1998.
[27] Linley Gwennap. Intel Discloses New IA-64 Features. Microprocessor Report, March 8,
1999.
[28] Tom R. Halfhill. Fujitsu FR-V Architecture Bets On VLIW. Microprocessor Report, 13(10),
August 1999.
[29] John L. Hennessy and David A. Patterson. Computer Architecture: a quantitative approach.
Morgan Kaufmann, 1996.
[30] Hitachi. The SH7750 Reference Manual.
[31] P. Y. Hsu and E. S. Davidson. Highly concurrent scalar processing. In Proceedings of the
13th International Symposium on Computer Architecture, pages 386�395, June 1986.
Bibliography 133
[32] W. W. Hwu and Y. N. Patt. HPSm, a high performance restricted data �ow architec-
ture having minimal functionality. In Proceedings of the 13th International Symposium on
Computer Architecture, pages 297�306, June 1986.
[33] Mike Johnson. Superscalar Miprocessor Design. Prentice-Hall, 1991. ISBN 0-13-875634-1.
[34] D. J. Kuck. The Structure of Computers and Computations. John Wiley and Sons, New
York, NY, 1978.
[35] C. Lee, M. Potkonjak, and W. Mangione-Smith. MediaBench: A tool for evaluating and
synthesizing multimedia and communications systems. In Proceedings of the 30th Annual
International Symposium on Microarchitecture, pages 330�335, December 1997.
[36] J. Llosa, M. Valero, and Ayguadé. Quantitative Evaluation of Register Pressure on Software
Pipelined Loops. International Journal of Parallel Programming, 26(2):121�142, 1998.
[37] J. Llosa, M. Valero, and E. Ayguadé. Heuristics for Register-Constrained Software Pipelin-
ing. In Proc. of the 29th Ann. Int. Symp. on Microarchitecture (MICRO-29), pages 250�261,
December 1996.
[38] Josep Llosa. Reducing the Impact of Register Pressure on Software Pipelined Loops. PhD
thesis, Universitat Politenica de Catalunya, January 1996.
[39] Josep Llosa, Antonio Gonzalez, Eduard Ayguade, and Mateo Valero. Swing Modulo
Scheduling: A Lifetime-Sensitive Approach. In Parallel Architectures and Compilation
Techniques (PACT'96), pages 80�86, October 1996.
[40] P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, R. P. Nix, J. S.
O'Donnell, and J. C. Ruttenberg. The Multi�ow Trace Scheduling Compiler. The Journal
of Supercomputing, 7(1):51�142, January 1993.
[41] S. A. Mahlke, W. Y. Chen, R. A. Bringmann, R. E. Hank, W. W. Hwu, B. R. Rau,
and M. S. Schlansker. Sentinel Scheduling: A Model for Compiler-Controlled Speculative
Execution. ACM Transactions on Computer Systems, 11(4), November 1993.
[42] S. A. Mahlke, W. Y. Chen, P. P. Chang, and W. W. Hwu. Scalar program performance
on multiple-instruction-issue processors with a limited number of registers. In Proceedings
of the 25th Annual Hawaii International Conference on System Sciences, pages 34�44,
January 1992.
[43] S. A. Mahlke, R. E. Hank, J.E. McCormick, D. I. August, and W. W. Hwu. A comparison
of full and partial predicated execution support for ILP processors. In Proceedings of the
22th International Symposium on Computer Architecture, pages 138�150, June 1995.
[44] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. E�ective
Compiler Support for Predicated Execution Using the Hyperblock. In Proceedings of the
25th International Symposium on Microarchitecture, pages 45�54, December 1992.
[45] Wen mei W. Hwu, Scott A. Mahlke, William Y. Chen, Pohua P. Chang, Nancy J. Warter,
Roger A. Bringmann, Roland G. Ouellette, Richard E. Hank, Tokuzo Kiyohara, Grant E.
Haab, John G. Holm, and Daniel M. Lavery. The Superblock: An E�ective Technique
for VLIW and Superscalar Compilation. The Journal of Supercomputing, pages 229�248,
1993. Kluwer Academic Publishers.
134 Bibliography
[46] Motorola Inc. MMC2001 Reference Manual, 1998.
[47] F. Mueller and D. B. Whalley. Avoiding conditional branches by code replication. In
Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and
Implementation, pages 55�66, June 1995.
[48] J. C. Park and M. S. Schlansker. On predicated execution. Technical Report HPL-91-58,
Hewlett Packard Laboratories, Palo Alto, CA, May 1991.
[49] Philips. TM1000 product pro�le. http://www.semiconductors.com/trimedia/products/.
[50] Christian Piguet, Jean-Marc Masgonty, Claude Arm, Serge Durand, Thierry Schneider,
F lavio Rampogna, Ciro Scarnera, Christian Iseli, Jean-Paul Bardyn, R. Pache, and Evert
Dijkstra. Low-Power Design of 8-b Embedded CoolRISC Microcontroller Cores. IEEE
Journal Of Solid-State Circuits, 32(7):1067�1078, July 1997.
[51] D. N. Pnevmatikatos and G. S. Sohi. Guarded execution and branch prediction in dy-
namic ILP processors. In Proceedings of the 21st International Symposium on Computer
Architecture, pages 120�129, April 1994.
[52] Jean-Michel Puiatti, Christian Piguet, Eduardo Sanchez, and Josep Llosa. Low-power
VLIW Processors: A High-level Evaluation. In 8th International Workshop on Power and
Timing Modeling, Optimization and Si mulation (PATMOS'98), October 1998.
[53] B. R. Rau and J. A. Fisher, editors. Instruction-Level Parallelism, volume 7. Kluwer
Academic Publishers, 1993. A Special Issue of The Journal of Supercomputing.
[54] B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle. The Cydra 5 departmental super-
computer. IEEE Computer, 22(1):12�35, January 1989.
[55] M. Schlansker and V. Kathail. Acceleration of �rst and higher order recurrences on pro-
cessors with instruction level parallelism. In Proceedings of Languages and Compilers for
Parallel Computing, 6th International Workskop, August 1993.
[56] M. Schlansker and V. Kathail. Critical path reduction for scalar programs. In Proceedings
of the 28th International Symposium on Microarchitecture, pages 57�69, December 1995.
[57] J. Scott and B. Moyer L. Hwang Lee, J. Arends. Designing the low-power M-CORE
architecture. In Power Driven Microarchitecture Workshop, pages 102�106, June 1998.
http://www.cs.colorado.edu/�grunwald/LowPowerWorkshop/agenda.html.
[58] Dezsö Sima, Terence Fountain, and Péter Kacsuk. Advanced Computer Architectures: A
Design Space Approach. Addison Wesley Longman, 1997. ISBN 0-201-42291-3.
[59] J. E. Smith. A study of branch prediction strategies. In Proceedings of the 8th International
Symposium on Computer Architecture, pages 135�148, May 1981.
[60] James E. Smith and Andrew R. Pleszkun. Implementation of Precise Interrupts in
Pipelined Processors. In Proc. 12th Annual Symposium on Computer Architecture, pages
36�44, June 1985.
[61] Peter Song. M-Core for the Portable Millennium. Microprocessor Report, February 16,
1998.
Bibliography 135
[62] Texas Instrument. The TMS320C6201 Reference Manual.
http://www.ti.com/sc/docs/products/dsp/tms320c6201.html.
[63] James L. Turley. Thumb Squeezes ARM Code Size. Microprocessor Report, 9(4), March
1995.
[64] Jim Turley. Hitachi sh-3 hits 100 mips. Microprocessor Report, 9(3), March 1995.
[65] Jim Turley. ARM Grabs Embedded Speed Lead. Microprocessor Report, 10(2), February
1996.
[66] Jim Turley. ARM Tunes Piccolo for DSP Performance. Microprocessor Report, 10(15),
November 1996.
[67] Jim Turley. Hitachi SH-4 Gets Graphically Superscalar. Microprocessor Report, 10(14),
October 28, 1996.
[68] Jim Turley. LSI's TiniyRisc Core Shrinks Code Size. Microprocessor Report, 10(14),
October 28, 1996.
[69] Jim Turley. ARM9 Doubles ARM Performance in 98. Microprocessor Report, December
8, 1997.
[70] Jim Turley. M-Core Shrink Code, Power Budgets. Microprocessor Report, October 27,
1997.
[71] Jim Turley. Selecting a High-Performance Embedded Microprocessor. MicroDesign
Ressources, second edition, 1997. ISBN 1-885330.
[72] Jim Turley. M-Core M300 Gains Poise, Performance. Microprocessor Report, December 7,
1998.
[73] Jim Turley. MMC2001 launches M-Core odyssey. Microprocessor Report, March 30, 1998.
[74] Jim Turley and Harri Hakkarainen. TI's new 'C6x DSP screams at 1600 MIPS. Micropro-
cessor Report, 11(2), February 1997.
[75] J. F. Wakerly. Digital Design: Principles and Practices. Prentice Hall, Englewood Cli�s,
NJ, 1994.
[76] N. J. Warter, S. A. Mahlke, W. W. Hwu, and B. R. Rau. Reverse if-conversion. In
Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design
and Implementation, pages 290�299, June 1993.
[77] W. Wolf. Modern VLSI Design: Systems on Silicon. Prentice Hall, New Jersey, 2nd edition,
1998.
[78] Ole Wolfe and Je� Bier. StarCore Launches First Architecture. Microprocessor Report,
October 26, 1998.
[79] M. Yang, G.-R. Uh, and D. B. Whalley. Improving performance by branch reordering. In
Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and
Implementation, June 1998.
[80] Gary K. Yeap. Practical Low Power VLSI Design. Kluwer Academic Publishers, 1998.
136 Bibliography
[81] T. Y. Yeh and Y. N. Patt. Two-level adaptive training branch prediction. In Proceedings
of the 24th Annual International Symposium on Microarchitecture, pages 51�61, November
1991.
Jean-Michel Puiatti
Personal Data
Born July 29, 1971 in Geneva, Switzerland. Single.
Citizenships: Swiss, Spanish, Italian.
Work Logic Systems Laboratory, Swiss Federal Institute of Technology,
IN-Ecublens, CH-1015 Lausanne, Switzerland.
Phone: +41�21�693�6630, Fax: +41�21�693�3705,
E-mail: Jean-Michel.Puiatti@ep�.ch, http://lslwww.ep�.ch/�puiatti
Home Maisonneuve 12D, CH�1219 Châtelaine, Switzerland.
Phone: +41�22�796-2908
Education
1995�1999 Swiss Federal Institute of Technology, Lausanne, Switzerland.Ph.D. Candidate, Computer Science.
Thesis title: Instruction-level parallelism for low-power processors.
In collaboration with the Centre Suisse d'Electronique et de
Microtechnique (CSEM SA).
1991�1995 Swiss Federal Institute of Technology, Lausanne, Switzerland.Diploma in Computer Engineering.
1986�1991 Engineering High School, Geneva, Switzerland.Graduated with honors in June 1991 in Electrical Engineering.
Work Experience
1995�present Swiss Federal Institute of Technology, Lausanne, Switzerland.Logic Systems Laboratory, Computer Science Department.
Research and Teaching Assistant in digital system design and computer
architecture.
Apr.98�Sep.98 University of Illinois at Urbana-Champaign, USA.IMPACT group, Center for Reliable and High-Performance Computing,
Implementation of compiler optimization techniques for parallel predicated
code.
Aug.97�Oct.97 Universitat Politècnica de Catalunya, Barcelona, Spain.Computer Architecture Department (DAC).
Implementation of performance and energy consumption estimators for the
parallel execution of loops in a VLIW CoolRISC architecture.
Mar.96�Jul.96 Centro Nacional de Microelectrónica, Barcelona, Spain.Study of a distributed autonomous sensor system.
Grants and proposals
1999 Bene�ts of EPIC architecture for multimedia applications.Contributed in securing a three-year project funded by Hewlett Packard.
1996�1999 Instruction-level parallelism for low-power processors.Obtained a three-year grant from the Centre Suisse d'Electronique et de
Microtechnique for research on high-performance low-power processors.
Languages
French � native
Spanish � excellent
English � good
Italian � good
German � basic
Hobbies
Soccer, squash, rock climbing, photography, classical and electric guitar.
Publications
D. A. Connors, J.-M. Puiatti, D. I. August, K. M. Crozier, and W. W.
Hwu. An Architecture Framework for Introducing Predicated Execution
into Embedded Processors, to appear in Proceedings of Euro-Par, Septem-
ber 1999.
D. I. August, J. W. Sias, J.-M. Puiatti, S. A. Mahlke, D. A. Connors, K.
M. Crozier, and W. W. Hwu. The Program Decision Logic Approach to
Predicated Execution, 26th International Symposium on Computer Archi-
tecture, May 1999.
G. Ritter, J.-M. Puiatti, E. Sanchez. Leonardo and discipulus simplex: An
Autonomous, Evolvable Six-Legged Walking Robot, Recon�gurable Archi-
tectures Workshop (RAW'99), 13th International Parallel Processing Sym-
posium & 10th Symposium on Parallel and Distributed Processing, San-
Juan (Puerto Rico), April 1999.
J.-M. Puiatti, C. Piguet, E. Sanchez, J. Llosa. Low-Power VLIW Proces-
sors: A High-Level Evaluation, 8th International Workshop on Power and
Timing Modeling, Optimization and Simulation (PATMOS'98), Lyngby
(Copenhagen-Denmark), October 1998.
J.-M. Puiatti, E. Sanchez, C. Piguet, J. Llosa, VLIW Architectures for Low-
Power Processors: A First Evaluation, 24th European Solid-State Circuits
Conference (ESSCIRC'98), The Hague (Netherlands), September 1998.
E. Mosanya, J.-M. Puiatti, E. Sanchez. Hardware Implementation of Gen-
eralized Pro�le Search on the GENSTORM Machine, IEEE Symposium on
FPGAs for Custom Computing Machines (FCCM'98), Napa Valley (CA-
USA), April 1998.