the perception processor - citeseer

THE PERCEPTION PROCESSOR

by

Binu K. Mathew

A dissertation submitted to the faculty ofThe University of Utah

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Computer Science

School of Computing

The University of Utah

August 2004

Copyright c© Binu K. Mathew 2004

All Rights Reserved

THE UNIVERSITY OF UTAH GRADUATE SCHOOL

SUPERVISORY COMMITTEE APPROVAL

of a dissertation submitted by

Binu K. Mathew

This dissertation has been read by each member of the following supervisory committee and bymajority vote has been found to be satisfactory.

Chair: Al Davis

John B. Carter

Ganesh Gopalakrishnan

Erik Brunvand

William C. Athas

THE UNIVERSITY OF UTAH GRADUATE SCHOOL

FINAL READING APPROVAL

To the Graduate Council of the University of Utah:

I have read the dissertation of Binu K. Mathew in its final form andhave found that (1) its format, citations, and bibliographic style are consistent and acceptable;(2) its illustrative materials including figures, tables, and charts are in place; and (3) the finalmanuscript is satisfactory to the Supervisory Committee and is ready for submission to TheGraduate School.

Date Al DavisChair: Supervisory Committee

Approved for the Major Department

Christopher R. JohnsonChair/Director

Approved for the Graduate Council

David S. ChapmanDean of The Graduate School

ABSTRACT

Recognizing speech, gestures, and visual features are important interface capabil-

ities for future embedded mobile systems. Unfortunately the real-time performance

requirements of complex perception applications cannot be met by current embedded

processors and often even exceed the capability of high performance microprocessors.

The energy budget of current high performance processors is infeasible in the embedded

space. The normal approach is to resort to a custom ASIC to meet performance and

energy constraints. However ASICs incur expensive and lengthy design cycles. They are

so specialized that they are unable to support multiple applications or even evolutionary

improvements in a single application. This dissertation introduces a VLIW perception

processor that uses a combination of clustered function units, compiler controlled data-

flow and compiler controlled clock-gating in conjunction with hardware support for mod-

ulo scheduling, address generation units and a scratch-pad memory system to achieve

very high performance for perceptual algorithms at low energy consumption. The archi-

tecture is evaluated using benchmark algorithms taken from complex speech and visual

feature recognition, security, and signal processing domains. Since energy and delay are

common design trade-offs, the energy-delay product of a CMOS implementation of the

perception processor is compared against ASICs and general purpose processors. Using

a combination of Spice simulations, real processor power measurements and architecture

simulation it is shown that the perception processor running at 1 GHz clock frequency

outperforms a 2.4 GHz Pentium 4 by a factor of 1.75. While delivering this performance

it simultaneously achieves 159 times better energy delay product than a low power Intel

XScale embedded processor.

The perception processor makes sophisticated real-time perception applications pos-

sible within an energy budget that is commensurate with the embedded space, a task that

is impossible with current embedded processors.

This dissertation is dedicated to

A, T, C, G, 1 and 0, the building blocks of intelligence.

And to the pioneers uncovering the foundations of intelligence.

CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

LIST OF ACRONYMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

CHAPTERS

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 The Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2. RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Optimization and Characterization ofPerception Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Image and Neural Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 High ILP Processors for Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Custom Hardware for Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5 Balancing Performance and Power Consumption . . . . . . . . . . . . . . . . . . 152.6 Distinguishing Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3. PRINCIPLES BEHIND DYNAMIC POWER REDUCTION . . . . . . . . . . 21

3.1 Dynamic Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Power Reduction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Process Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.1 Constant Field Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.2 Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.3 Frequency Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 The ET nMetric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Energy Delay Squared Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4. SPEECH RECOGNITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4 Overall Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.5 Architectural Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5. CHARACTERIZATION AND OPTIMIZATION OF SPHINX 3 . . . . . . . 41

5.1 Memory System Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2 ILP in Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Results of Software Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.1 Cache Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.3.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4 The HMM Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6. A CUSTOM GAUSSIAN ACCELERATOR . . . . . . . . . . . . . . . . . . . . . . . 52

6.1 Top Level Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2 Coprocessor Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5 Accelerator Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.5.1 Energy Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.5.3 Bandwidth Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7. VISUAL FEATURE RECOGNITION ALGORITHMS . . . . . . . . . . . . . . 64

7.1 Flesh Toning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687.3 Rowley Face Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.4 Viola and Jones’ Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.5 Eigen Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.6 Architectural Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8. CHARACTERIZATION OF VISUAL FEATURE RECOGNITION . . . . . 77

8.1 Application Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.2 Optimization Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9. PERCEPTION PROCESSOR ARCHITECTURE . . . . . . . . . . . . . . . . . . 87

9.1 Pipeline Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899.2 Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919.3 Function Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929.4 Compiler Controlled Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.5 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959.6 Memory System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9.6.1 Loop Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.6.2 Stream Address Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1019.6.3 Array Variable Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059.6.4 Addressing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

9.7 Compiler Controlled Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107vii

9.8 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1089.9 Programming Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

10. EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

10.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11610.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12010.3 Experimental Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12010.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

10.4.1 Instruction Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12310.4.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12410.4.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12610.4.4 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12910.4.5 Energy Delay Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12910.4.6 Energy Delay Squared Product . . . . . . . . . . . . . . . . . . . . . . . . . . . 13210.4.7 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13310.4.8 The Cost of Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

11. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

12. FUTURE RESEARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

viii

LIST OF FIGURES

1.1 Perception Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 High Level Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.1 Signal Processing Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Triphone HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 L1 Dcache Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 L2 Cache Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3 L2 to Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.4 GAU and GAU OPT IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.5 HMM IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.6 Measured Speedup on R12K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.7 Cache Optimized Gaussian Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1 Top Level Organization of Gaussian Estimator . . . . . . . . . . . . . . . . . . . . . 54

6.2 Gaussian Coprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.3 Channel Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.1 Algorithmic Stages of a Face Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . 66

8.1 Execution Time Break Down of Viola/Jones Detector Based FaceRecognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.2 Execution Time Break Down of Rowley Detector Based Face Recognizer . 80

8.3 L1 Dcache Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8.4 L2 Cache Hit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.5 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8.6 Speedup or Slow Down Over Real Time . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9.1 Perception Processor Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

9.2 Pipeline Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

9.3 Microinstruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

9.4 Function Unit Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

9.5 Interconnect Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

9.6 Loop Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

9.7 Stream Address Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

9.8 Loop Acceleration Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

9.9 Matrix Multiply Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

9.10 Inner Product Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

9.11 Assembly Code for Interleaved Inner Product . . . . . . . . . . . . . . . . . . . . . . 113

10.1 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

10.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

10.3 Throughput Normalized to Pentium 4 Throughput . . . . . . . . . . . . . . . . . . 127

10.4 Process Normalized Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . 130

10.5 Process Normalized Energy Delay Product . . . . . . . . . . . . . . . . . . . . . . . . 131

10.6 Process Normalized Energy Delay Squared Product (ET 2) . . . . . . . . . . . . 132

10.7 Impact of Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

10.8 Energy Consumption of PP+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

10.9 Energy Delay Product of PP+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

12.1 Generic Stream Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

12.2 Stream Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

x

LIST OF TABLES

5.1 Experiment Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8.1 Experiment Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

LIST OF ACRONYMS

ALU arithmetic logic unit

ANN artificial neural network

ASIC application-specific integrated circuit

CAD computer-aided design

CMOS complementary metal-oxide semiconductor

CMU Carnegie Mellon University

CPU central processing unit

DMA direct memory access

DRAM dynamic random-access memory

DSP digital signal processor

DTLB data TLB

Dcache data cache

EDP energy delay product

FFT fast Fourier transform

FIR finite-length impulse response

FPGA field-programmable gate array

FPU floating point unit

FU function unit

GCC GNU compiler collection

GNU GNU’s not Unix

HDL hardware description language

HMM hidden Markov model

HSV hue saturation value

Hub, Hub-4 a speech recognition benchmark created by the National Institute of Stan-dards and Technology; also a speech model developed for this benchmark by CMU

IEEE Institute of Electrical and Electronics Engineers

ILP instruction level parallelism

IPC instructions per cycle

ISA instruction set architecture

ITLB instruction TLB

Icache instruction cache

L1 level 1

L2 level 2

MIPS million instructions per second; also, the company MIPS Inc.

MLP multilayer perceptron

MTCMOS multithreshold CMOS

NCC normalized color coordinates

NOP null/no operation

RAM random-access memory

RGB red green blue

RISC reduced instruction set computer

ROM read-only memory

SGI a company formerly known as Silicon Graphics Inc.

SIMD single instruction multiple data

SRAM static random-access memory

TLB translation look-aside buffer

TSMC Taiwan Semiconductor Manufacturing Company

VLIW very large instruction word

xiii

ACKNOWLEDGMENTS

Over the years I have discussed numerous half baked ideas with my advisor Prof.

Al Davis who has always been encouraging of new concepts. This dissertation is a

testament to Al’s willingness to let me explore territory that was totally outside the

realm of his previous research. In spite of my naive protests he gently, but unrelentingly,

insisted for several years until I finally grasped it that reducing power consumption was

a more important problem than high performance. I am very grateful for Al’s flexible but

insistent advising style. Mike Parker has contributed greatly to my research through nu-

merous hours of consultation and help with power measurements. Ali Ibrahim deserves

thanks for his help with a prototype that eventually led to the perception processor. Profs.

John Carter and Sally McKee extensively schooled me in the technique of writing papers

during my first two years of graduate school. They deserve all the credit for improving

my writing skill. Any deficiencies are due to my stubborn adherence to my own style.

Finally, my family and my friends, particularly BJP, Shilu, Asha Amal and Yea teasingly

convinced me over the last several years that a Ph.D. meant a lot more than anyone really

believed. Thank you!

CHAPTER 1

INTRODUCTION

The term Perception Processing encompasses processor support for technologies that

can enable computers to perceive the world the way we humans do with our sensory

faculties. It targets areas like object detection, recognition and tracking, speech and ges-

ture recognition and multimodal abilities like lip reading to support speech recognition.

The applications for perception processing are both immense and diverse. More and

more computing devices are being invisibly embedded into our living environment, and

we notice their existence only when they cease to serve us. For this fledgling comput-

ing fabric to develop into tomorrow’s ubiquitous computing environment, the primary

means of interacting with it should be human friendly ones like speech and gesture.

Future mobile embedded environments need to support sophisticated applications such as

speech recognition, visual feature recognition, secure wireless networking, and general

media processing. Work environments from the board room to the garage will eventually

feature human friendly and hands free interfaces to the computers embedded into those

environments. Perception prosthetics are an important application too. Devices that

listen to speech and then project text on a heads up display worn by a deaf person or

an intelligent camera that gives audio cues like “Vehicle approaching,” “Stairs 10 feet

ahead” to a blind wearer are of particular interest. Another important application area

is robotics – the opportunity to outfit both manned and autonomous vehicles, industrial

and house hold robots and even machine tools with tireless vision presents boundless

opportunity. Other areas that could benefit from perception processing include automated

surveillance, translation of speech and a variety of assistive technologies.

2

1.1 The Problem

By their very nature, perception applications are likely to be most useful in mobile

embedded systems. A fundamental problem that plagues these applications is that they

require significantly more performance than current embedded processors can deliver.

Most embedded and low-power processors, such as the Intel XScale, do not have the

hardware resources and performance necessary to support a full featured speech rec-

ognizer. Even modern high performance microprocessors are barely able to keep up

with the real-time requirements of sophisticated perception applications. The energy

consumption that accompanies the required performance level is often orders of mag-

nitude beyond typical embedded power budgets. This dissertation attempts to develop

a specialized processor architecture that can provide high performance for perception

applications in an energy-efficient manner.

Figure 1.1 shows actual measured performance of two perception applications: CMU

Sphinx 3, a speech recognition system, and FaceRec, a face recognition application. The

applications were run on Intel Pentium III and later processors with clock speeds varying

from 900 MHz to 3 GHz. Details of these applications are presented in Chapters 4 and

7. The horizontal lines show the performance level required to achieve real-time targets.

For the speech recognizer, this involves recognizing a 29.2 second long speech recording

in the same interval of time. The workload for the face recognizer consists of processing

25 image frames in 5 seconds corresponding to the real-time target of handling 5 frames

of 320 × 200 pixel images every second.

Each of the smooth curves in the figure correspond to the hyperbola obtained by

assuming ideal scaling of performance with frequency. They are derived by starting with

the data point corresponding to 900 MHz and assuming that run time varies inversely

with frequency. It is evident that for speech recognition, the performance of the processor

does not scale ideally. In theory a 2.4 GHz processor should achieve real-time perfor-

mance. In practice a processor frequency of approximately 2.9 GHz is required to satisfy

real-time requirements. This performance gap suggests that when moving to more com-

plex future speech recognition workloads, higher frequencies alone are not the solution.

Fundamental architectural improvements are called for. The face recognizer demands

3

600800100012001400160018002000220024002600280030003200340036003800400042004400460048005000

CPU Freq (MHz)

600800100012001400160018002000220024002600280030003200340036003800400042004400460048005000

CPU Freq (MHz)

0

10

20

30

40

50

60

70

80

90

100

110

Run

tim

e (S

econ

ds)

Speech RecSpeech Rec - real timeSpeech Rec - theoreticalFace RecFace Rec - real timeFace Rec - theoretical

Figure 1.1. Perception Performance

a higher level of performance than is currently available. Its real-time requirements

demand a 4.8 GHz or faster processor. The complexity of both workloads is likely to

increase significantly in the future. The results clearly show that perception applications

stress the performance limits of high end processors and low power embedded processors

may never have the compute power required for perception applications.

Given Moore’s law performance scaling, the performance issue is not by itself a

critical problem. However two significant problems remain. First, the energy expended

in high performance processors is intractable in the embedded space. Furthermore, the

power requirements of new processors is increasing. The conclusion is that technol-

ogy scaling alone cannot solve the problems faced by perception applications. Second,

perception and security interfaces are by nature always operational. This limits the pro-

cessor’s availability for other compute tasks such as understanding what was perceived.

4

The usual solution to reducing power consumption while increasing performance

is to use an Application Specific Integrated Circuit (ASIC). Given the complexity and

the always on nature of perception tasks, a more relevant approach would be to use

the ASIC as a coprocessor in conjunction with a low power host processor. As a part

of this research, an ASIC coprocessor for one of the dominant phases of the CMU

Sphinx speech recognition system was investigated. Details may be found in Chapter

6. This effort led to the usual realization that ASICs are costly and inflexible. Their

high fabrication cost coupled with the costs associated with a lengthy design cycle are

difficult to amortize. The inherent level of specialization in an ASIC makes it extremely

difficult to support multiple applications, new methods, or even evolutionary algorithmic

improvements. Given that embedded applications evolve rapidly and that embedded

systems are extremely cost sensitive, these problems provide significant motivation to

explore a more general purpose approach. The use of reconfigurable logic and FPGA

devices is another common approach [31]. The inherent reconfigurability of FPGAs

provides a level of specialization while retaining significant generality. However the

reconfiguration time is relatively long, and FPGAs have a significant disadvantage both

in performance and power when compared to either ASIC or CPU logic functions.

1.2 The Solution

This dissertation addresses the design of programmable processors that can handle

sophisticated perception workloads in real time at power budgets suitable for embedded

devices. Programmable processors optimized for the perception domain are intended

to be used as a coprocessors for general purpose host processors. A high level view

of the architecture is shown in Figure 1.2. A number of function units are organized

as a cluster and embedded in a rich interconnection network that provides connection

between function units in the cluster and four memories. The host processor moves

data into or out of the coprocessor via double buffered input and output SRAMs. Local

storage for the cluster is provided by the scratch SRAM, and the microcode program

that controls the operation of the cluster is held in the u-Code SRAM. The execution

cluster can be customized for a particular application by the selection of function units.

5

Figure 1.2. High Level Architecture

In fact the type and number of function units, SRAMs, address generators, bit widths

and interconnect topology are specified using a configuration file. The hardware design

(Verilog netlist) and a customized simulator are automatically generated by a cluster

generator. Henceforth the term perception processor refers to the generic architecture

behind any domain-specific processor created using the cluster generator tool.

Perception algorithms tend to be stream oriented, i.e., they process a sequence of

similar data records where the data records may be packets or blocks of speech signals,

video frames or the output of other stream processing routines. Each input packet is

processed by a relatively simple and regular algorithm that often refers to some limited

local state tables or history to generate an output packet. The packets have fixed or

variable but bounded sizes. The algorithms are typically loop oriented with dominant

components being nested for loops with flow-dependent bodies. Flow dependence im-

plies that loop-carried dependences have constant distances in the iteration space of the

nested loop structure. Processors that are optimized for this style of computation are

6

called stream processors. While there are subtle differences in the nuances, the notion of

streams and algorithm kernels described here is essentially the same as that developed by

Dally et al. for the Imagine Stream Processor [82]. The perception processor developed

in this research is a specialized stream processor optimized for speech recognition and

vision. However, attempts will be made to show its generality to other stream oriented

algorithms in Chapters 10 and 12.

Fine-grained control of physical resources is provided by a horizontal microcode

program. The architecture and the fine-grained control mechanism support data flows

that resemble the custom computational pipelines found in ASICs. Software based

control provides a significant level of generality. Any algorithm can be mapped onto

the cluster, albeit with varying levels of efficiency. The result is a cluster that can be

tailored to a particular domain and can support multiple applications or applications

phases. The approach includes a specialized microcode compiler that maps applications

onto the perception processor. Currently, the input to the compiler is a tiny specialized

language implemented on top of the Python scripting language. It supports constructs for

various types of for loops, array access patterns, opcode mnemonics, loop unrolling and

processor reconfiguration requests. Compilers for more general languages like C or C++

are definitely possible, but have not been implemented. The compiler uses hardware sup-

port for modulo-scheduled loops in conjunction with array address generators to deliver

high throughput for flow dependent loops [81]. The microcode provides fine-grained

control over data steering, clock gating and function unit utilization and it permits single

cycle reconfiguration of address generators.

Energy efficiency is primarily the result of minimized communication and activity.

The compiler uses fine-grained clock gating to ensure that each function unit is active

only when required. Compiler-controlled dataflow permits software to explicitly address

output and input stage pipeline registers of function units and orchestrate data transfer

between them over software-controlled bypass paths. Data values are transported only

if necessary, and the compiler takes care to ensure that value changes are visible on

heavily loaded wires and forwarding paths only if a unit connected to that path needs

the data value. By explicitly enabling pipeline registers the compiler is able to control

7

the lifetime of function unit outputs and directly route data to other function units,

avoiding unnecessary access to a register file. The resulting dataflows or active datapaths

resemble custom computational pipelines found in ASICs, but have the advantage of

flexibility offered by software control. This may be thought of as a means of exploiting

the natural register renaming that occurs when a multistage pipeline shifts and each

individual pipeline register gets a new value. However the active datapath in the cluster

will utilize multiplexer circuits that provide generality at the cost of power, area and

performance. These muxes and the associated penalties will not be present in a custom

ASIC design.

The resulting architecture is powerful enough to support complex perception algo-

rithms at energy consumption levels commensurate with mobile device requirements.

The approach represents a middle ground between general purpose embedded proces-

sors and ASICs. It possesses a level of generality that cannot be achieved by a highly

specialized ASIC, while delivering performance and energy efficiency that cannot be

matched by general purpose processor architectures.

1.3 Road Map

Chapter 2 will provide a brief introduction to previous research pertaining to the

optimization and acceleration of perception applications. Chapter 3 describes the basic

principles behind power reduction in CMOS circuits and introduces metrics that will be

used later for evaluating the perception processor. This is followed by Chapters 4 and

Chapter 5, which provide an introduction to the foundations of speech recognition and

a performance analysis of the CMU Sphinx 3.2 speech recognition system respectively.

Chapter 6 presents the design and evaluation of an ASIC coprocessor for a dominant

phase of Sphinx. Computer vision algorithms used in the FaceRec application mentioned

previously are introduced in Chapter 7 and the application itself is characterized in

Chapter 8. The architecture of the perception processor is elaborated in Chapter 9, and

its performance and energy efficiency are analyzed in Chapter 10. Chapter 11 draws

conclusions and highlights important results. Finally, Chapter 12 points out avenues for

future research.

CHAPTER 2

RELATED WORK

While there has been a considerable body of research targeted at accelerating artificial

neural networks (ANN) in general, very little work has been directed towards the archi-

tectural needs of perception processing and low power implementations of perception

functions. Related areas of research can be classified broadly into optimization and

characterization of perception workloads and various special purpose, parallel and low

power coprocessor architectures. The following sections highlight representative work

in each of these areas. In addition, general research in processor and reconfigurable logic

not specifically targeted at perception yet capable of sustained high performance at low

power budgets will also be discussed.

2.1 Optimization and Characterization ofPerception Applications

Perception processing, which encompasses a wide range of topics like computer

vision, speech recognition and gesture recognition, is currently the focus of vigorous

research. While it is common in the literature to see the relative merits and performance

of algorithms compared, architecture level analysis of whole perception applications is

extremely rare. Traditional research in perception has been geared towards improving ac-

curacy. Performance is a secondary goal, and power efficiency has been largely ignored.

For instance, the yearly Hub speech recognition evaluation reports typically emphasize

improvements in recognition accuracy and mention improvements in performance as a

multiple of “slow down over real time” [30, 92].

Ravishanker’s research improved the performance of the Sphinx speech recognition

system by trading off accuracy in a computationally intensive phase for faster run time

and then recovered the lost accuracy by doing additional processing in a computationally

9

cheaper phase of the application [74]. This research also reduced the memory footprint

of speech recognition by using a disk based language model cached in memory by the

software.

Agram, Burger and Keckler characterized the Sphinx II speech recognition system in

a manner useful for computer architects [6]. They focused on ILP, as well as memory

system characteristics such as cache hit rates and block sizes, and concluded that avail-

able ILP was low. They compared the characteristics of the Sphinx II system with those

of Spec benchmarks and also hinted at the possibilities and problems associated with

exploiting thread level parallelism.

Researchers at the Intel ICRC labs published a performance analysis of a speech

recognition system for Mandarin Chinese [59]. This study focused on the run time and

the size of the working set while executing the Intel speech recognition system on several

different versions of the x86 processor. They reported a decrease in ILP with increased

clock rate. IPC decreased from between 1 and 1.2 at 500 MHz to approximately 0.4 at

1.5 GHz – a clear indication that increasing clock rate is not the solution to improving

speech recognition performance. The decrease in ILP was attributed to memory system

behavior, but a detailed explanation was not provided. The ICRC speech system is not

publicly available, but the underlying semicontinuous HMM technique is the same as

that used by Sphinx. An experiment reported by the Intel researchers claimed to achieve

faster than real time recognition – 1.14 times faster than real time on a 1 GHz processor

and 1.33 times faster than real time on a 1.5 GHz Pentium 4 processor. The results

from Figure 1.1 show that Sphinx1 is 2.5 times and 1.5 times slower than real time

on 1 GHz and 1.8 GHz Intel Pentium processors respectively. It is possible that the

workload and vocabulary used by the Intel researchers was considerably simpler than

the one used with Sphinx. Ravishanker reported that for Sphinx II, the language model

search consumed about 40% of the recognition time [74]. For the Intel researchers, the

language model search is a very small fraction of the execution time. Details of the ICRC

speech model are not available. The huge gap in performance between Sphinx and the

1This is not the CMU version of Sphinx, but a heavily optimized version described in Chapter 5.

10

numbers published by ICRC is possibly because the ICRC speech model is simpler than

the Hub-4 speech model used to evaluate Sphinx.

Rabiner and Huang provide data on historical trends in the compute requirement of

continuous speech recognition. They predict that in the post year 2000 time it frame will

require the compute power of 20 to 60 DSP processors each delivering 1000 MIPS [79].

No published work on the power consumption characteristics of speech recognition is

known to exist at this time.

Compared to speech recognition, the algorithms used for perceptual computer vision

are far more diverse, and workload characterization results are almost nonexistent. The

problem is exacerbated by the fact that research is split into image understanding applica-

tions like automatic navigation and nonunderstanding applications like face recognition

and detection. A large volume of existing research emphasizes the parallelization and

hardware acceleration of early vision primitives like convolution, thresholding, segmen-

tation and connected component labeling [9, 105, 107]. Toolkits like Xvision and the In-

tel computer vision library provide optimized versions of such vision primitives [43, 52].

While there seems to be a consensus on early vision primitives for image understanding,

there seems to be very little agreement and commonality in the higher level aspects of

computer vision. Specialized systems for inspection of manufacturing defects, robot and

vehicle navigation exist, but seem to be highly domain specific. Representative examples

are commercial offerings by companies such as Cognex and Coreco, which provide

application specific software for industrial applications such as visual inspection, security

monitoring, motion detection, etc. [1, 2]. In contrast, nonunderstanding computer vision

applications seem to have more in common with each other, and complete applications

are more readily available. These also possess a synergistic nature – face detection and

lip tracking can augment speech recognition and improve recognition accuracy [102].

Rowley described an optimization for his neural network based face detector that

can process a single 320 × 200 image in 7.2 seconds on a 200 MHz R4400 [83]. He

reported that combined with flesh tone detection, it might be possible to reduce this time

to two to four seconds. Viola and Jones published a method of detecting faces that can

perform at a rate of 15 frames per second on a 700 MHz Pentium [103]. Their rapid

11

rate of detection depends on three fundamental advances. They propose a new image

representation called integral image that allows features used by their detector to be

computed rapidly. This image representation can be coupled with a learning algorithm,

which can select a small number of critical features from a large set and thus reduce

computation. They also describe a method to cascade increasingly complex classifiers

that prunes away uninteresting background regions so that the algorithm can spend more

time on the promising part of an image. Together, these optimizations claim a factor

of 15 speedup over the Rowley detector. Connell of the ECVG research group at IBM

reported being able to perform face detection at 90 frames per second on a 400 MHz

Pentium II by correlating the output of a variety of inaccurate and computationally cheap

face detectors [25]. Details of this system are currently not available.

There is a serious dearth of research characterizing the performance of face detectors.

This lack of published analysis can be mainly attributed to two different factors. First,

there is a wide variety of nonneural net based face detection techniques. The promi-

nent examples are support-vector based methods, naive Bayesian classifiers, template

matching and Eigen vector based techniques [110, 75]. Though each of these techniques

has its ardent proponents, the field as a whole is too fractured. Anyone undertaking an

architecture study is perplexed about which method is important. Second, most neural

net face detectors are based on multilayer perceptrons (MLP). Because of their regular

structure, it is simple to estimate the number of operations, bandwidth requirements,

etc. of an MLP network. While performance is easy to estimate, the degree of numerical

precision required, power consumption, die area, etc. are much more difficult to quantify.

Face recognition shares the same problem as face detection in that no performance and

power analysis studies are known to exist.

2.2 Image and Neural Processors

Neuromorphic system design pioneered by Carver Mead is a method of building

electronic circuits inspired by biological systems. For example, Boahen and colleagues at

the University of Pennsylvania designed the Visio 1, a chip that models photo-receptors

and the four major ganglion cell types found in a retina [15]. This low power chip

12

uses networks of ganglion cells to detect edges and distinguish directions of motion.

Harrison and Koch at the California Institute of Technology built a chip that integrated

photo-detectors and analog motion detectors to model the first three layers of the visual

system of a house fly [45]. They successfully used this chip to steer the direction of an

autonomous mobile robot in real time. While there are distinct power and performance

advantages to such neuromorphic chips, their analog nature, limited reconfigurability and

tight integration with photo-detectors make them unlikely candidates for integration into

low power digital computers for perception.

The Xetal processor developed by researchers at Philips Research labs takes the

approach of providing a low power programmable linear array of processors designed

to accept digital video data [57]. Xetal consists of an array of 320 programmable pro-

cessing elements that are laid out with communication channels and optimized to process

640×480 images at 30 frames per second. This processor is optimized for low power high

performance computations like convolution, color conversion, noise reduction, template

matching and image compression. No information is currently available on applying

Xetal to perception processing.

Fang, a researcher from NASA JPL, describes a low power system on chip design that

combines an on-chip camera with a neural net processor and a control microprocessor

[34]. This system developed for real-time vision applications in space exploration was

reported to be capable of functions like edge detection, connected component detection,

motion estimation, etc. Actual power and performance results are not available.

The Simpil processor designed at Georgia Tech is a focal plane SIMD architecture for

early vision applications like edge detection, image convolution and compression [24].

In Simpil, up to 16 pixels may be sampled by a SIMD node using A/D conversion and

processed locally. Arrays of nodes perform localized computations over the entire focal

plane. Estimated total power consumption for a 64× 64 array of SIMD nodes fabricated

in a 0.35µ process as four separate chips was 5.1 W while operating at 20 MHz.

13

2.3 High ILP Processors for Perception

The high performance microprocessor industry has devoted a lot of attention to de-

veloping short vector (SIMD) extensions like MMX, SSE, MDMX and VIS that cater to

the needs of multimedia applications [26, 37]. An Intel publication described the use of

SSE II instructions for Viterbi decoding of hidden Markov models [50]. Significant per-

formance improvement is claimed, but not quantified. The Intel computer vision library

provides SIMD optimized versions of commonly used vision algorithms [52]. Though

vector machines have long been the workhorse of scientific computing, the relevance of

short vector or SIMD optimizations to perception codes had not been appreciated fully

until recently. These techniques have been shown to improve performance by up to an

order of magnitude on DSP style algorithms and even on small speech processing codes

[55]. The trend has in general been to use short vectors to utilize SIMD parallelism and

to use the super-scalar scheduling infrastructure already available in modern out of order

processors to keep the SIMD units occupied rather than using real vector issue and long

vectors [11]. Shifting the task of identifying dependences and scheduling instructions

from a vectorizing compiler to dynamic issue logic has the distinct disadvantage of

increasing processor complexity as well as power consumption. Vector chaining has

been traditionally used as a performance enhancement mechanism [85]. The compiler

controlled dataflow approach developed in this dissertation can mimic vector chaining in

a more general manner and with low hardware overhead.

There have been numerous attempts to implement digital neural network processors

as vector or SIMD machines. CNAPS from Adaptive Systems and the NeuroMatrix

DSP from Module Research Center are representative examples [44, 72]. While neural

network algorithms have been a mainstay of perception research, the evaluation of such

architectures for well defined perception tasks or whole perception applications is rarely

found in the literature. A well known example is SPERT, a neural network and signal

processing accelerator board for workstations, based on the Torrent 0 vector micropro-

cessor jointly designed by the International Computer Science Institute and UC Berkeley

[106]. Evaluation of SPERT focused on training of forward and back-propagation neural

networks for tasks like probability estimation for a hidden Markov model based speech

14

recognizer. Both processor speed and the complexity of the recognition task have in-

creased greatly since the time of SPERT.

The performance of Multi-SPERT, a later design consisting of multiple SPERT boards

was measured to be over 530 million connection updates per second for a five node

configuration performing neural network training for speech recognition [36]. Moreto

analyzed SPERT’s performance on a partial implementation of RASTA-PLP, a speech

front-end signal processing program [73]. An implementation of RASTA for SPERT had

a significant impact in its day. A recent study reported that RASTA-PLP computation

took only 6.7% of the run time of a recognition task [6]. Clearly, the performance

bottlenecks have shifted with advances in speech recognition technology.

2.4 Custom Hardware for Perception

Pihl at the Norwegian University of Science and Technology designed the PDF co-

processor, a custom coprocessor in a 0.8µ CMOS process to accelerate the computation

of Gaussian observation probabilities in a hidden Markov model based speech recognizer

[77]. This research concluded that memory bandwidth was a limiting factor for Gaussian

computation. Pihl approached the memory bandwidth problem by using a new fixed

point representation called the dynamical circular fixed-point format, which reduced the

memory bandwidth requirement by half. The PDF coprocessor could evaluate 40,000

39-element Gaussian components in real time using this format at 154 MHz consuming

853 mW of power. The work was based on an early version of Sphinx. In the current

Sphinx 3.2 version, the workload has worsened by a factor of 15.3. This number, as well

as the bandwidth requirement, is expected to increase further in the future.

An earlier attempt to accelerate speech recognition may be found in the work of

Anatharaman and Bisiani [10]. They present a custom architecture as well as a multipro-

cessor architecture for improving the performance of the beam search algorithm used by

the CMU distributed speech recognition system.

Benedetti and Perona describe an FPGA based system that exploits memory locality

for real-time low level vision [13]. Their system targeted the fast prototyping of low level

vision techniques using observations about locality in pixel neighborhoods to achieve

15

2.8 GBytes/second bandwidth between SRAM components and FPGA based compute

elements.

2.5 Balancing Performance and Power Consumption

Given the rising interest in mobile devices and the widespread use of embedded

processors in control and monitoring applications, a large body of existing work has

been devoted to achieving high computational performance while also improving power

efficiency. The approach taken in this dissertation is to control a clock gated VLIW pro-

cessor consisting of a cluster execution units and a special purpose scratch-pad memory

system at a very fine granularity using horizontal microcode. All communication within

the cluster is scheduled under software control – a technique that will be referred to as

compiler controlled dataflow. In addition, the clock signal to each function unit is con-

trolled by the software on a cycle by cycle basis. This is called compiler controlled clock

gating. The details appear later in Chapter 9, but this synopsis is useful in considering

the relevance of preexisting approaches.

There are many vendors of high performance power efficient embedded processors

such as the Philips Trimedia, TI C62xx, and Lucent DSP16000 that can be effectively

scheduled to achieve reasonably low power operation [47, 100, 3]. Increasing perfor-

mance via VLIW instruction scheduling and instruction width reduction techniques is a

common theme in modern embedded systems [63, 108, 16, 8]. Efforts have demonstrated

the benefit of VLIW architectures for customization and power management [89]. Opti-

mization techniques for clustered VLIW architectures can also be found in the literature

[56]. However, these efforts do not address low-level communication issues. Caliber

uses an interesting software pipelining strategy that is targeted at reducing memory

pressure in VLIW systems. The primary mechanism is to distribute the register file

[8, 7]. In contrast, in this dissertation, the output stage pipeline registers of function

units and the associated forwarding paths will be managed as if they constituted a small

distributed register file. Tiwari et al. have explored scheduling algorithms for less

flexible architectures, which split an application between a general purpose processor

and an ASIC [95]. Lee investigated the power benefits of instruction scheduling for

16

DSP processors [61]. Eckstein and Krall focus on minimizing the cost of local vari-

able access to reduce power consumption in DSP processors [33]. Application-specific

VLIW clusters have been investigated by many researchers [60, 35]. Customizing a

VLIW processor to minimize power and maximize performance by only including the

necessary function units and specializing function units via operator fusion has been

studied and utilized by the Tensilica Corporation in their Xtensa architecture [40]. The

fine grain horizontal microcode approach taken in this dissertation can be viewed as

a fine-grained extension of the VLIW concept. However the addition of sophisticated

address generators, multiple address contexts per address generator, the removal of the

register file, and the fine-grained steering of data are aspects presented in Chapter 9 that

are not evident in these other efforts.

The MOVE family of architectures explored the concept of transport triggering,

where computation is done by transferring values to the operand registers of a function

unit and starting an operation implicitly via a move targeting a trigger register associated

with the function unit [48]. Like in the MOVE architecture, the concept of compiler

directed data transfer between function units is used in this dissertation too, but the

resultant architecture is a traditional operation triggered one and transport triggering is

not used.

The RAW machine has demonstrated the advantages of low level scheduling of data

movement and processing in function units spread over a two-dimensional space [104,

62]. The RAW work is similar to the research presented in this dissertation in many ways.

Low-level architectural resources are fully exposed to the compiler. Custom data flows

are scheduled by the compiler on resources that are inherently somewhat general purpose.

The primary differences arise from the basic design target. The RAW effort is directed

at demonstrating that high levels of performance can be achieved on an architecture

consisting of many fine-grained tiles. This dissertation is directed at demonstrating that

somewhat general purpose structures can be scheduled to achieve power efficiency that

competes with special purpose ASIC designs.

The Imagine architecture is organized to exploit high levels of internal bandwidth in

order to achieve high performance levels on stream based data [82]. Scheduling issues

17

are similar, but the target is performance rather than low power. Given the poor wire

scaling properties of deep submicron CMOS processes, it is somewhat inevitable that

function unit clusters will need to be considered in order to manage communication

delays in high performance wide issue super-scalar processors. Current DSP processors

like the TMS320C6000 already have clustered datapaths and register files [94]. These

approaches however are all focused on providing increased performance. The approach

taken in this dissertation is to improve both power and performance while retaining a

large degree of programmability.

One popular approach to specialization is the use of reconfigurable logic to provide

customization. Techniques vary from power aware mapping of designs onto commer-

cially available FPGA devices to hybrid methods where specialized function blocks are

embedded into a reconfigurable logic array [20, 18, 69, 31]. Of particular relevance

are compiler directed approaches that are similar to the compiler-controlled dataflow

approach used in this research [70]. However, this dissertation targets custom silicon

implementations rather than the higher level FPGA domain. FPGA based approaches

have a significant advantage when the phases of an application persist long enough to

amortize the relatively long reconfiguration times. The generality of the FPGA approach

also leads to excessive energy loss. The approach taken here is commensurate with more

rapid reconfiguration and exhibits significantly better energy efficiency.

A number of researchers have tried to predict the energy consumption of an ap-

plication running on a particular processor [84]. Wattch is a well known example of

high level simulation based power estimation [17]. Such high level approaches have

a number of benefits. They are useful early in the design flow, and the simulations

are several orders of magnitude faster than low level estimation using tools like Spice.

The disadvantage is that Wattch-like systems need to be calibrated to use high level

power models that take into account all the implementation specific details. When the

actual implementation differs from the power model provided to the tool, the power

estimate will be meaningless. Since the perception processor architecture described later

in this dissertation is significantly different from general purpose architectures modeled

by Wattch, the power estimates reported in this work will be based on low-level Spice

18

simulation of actual circuits.

Clock power is often the largest energy culprit in a complex design such as a modern

microprocessor [41, 96]. This is primarily because the clock signal potentially goes

everywhere on the chip. Clock gating is a popular technique that selectively turns off the

clock to portions of the chip that are not used at a particular time. Krashinsky studied

the benefits of clock gating applied at various levels of aggression on a microprocessor

design [58]. Tseng and Asanovic describe a technique that conserves register file power

when the value will be supplied from a bypass path [98]. This is similar in spirit to

compiler-controlled dataflow used in this dissertation except that the architecture de-

scribed in Chapter 9 eliminates the register file altogether and uses the bypass paths

to forward all values. There are two disadvantages to clock gating: the enable signal

must arrive sufficiently ahead of the clock signal, and the use of additional gates in the

signal path will increase clock skew. Both effects reduce the maximum achievable clock

frequency. For low-power design objectives, this is seldom a serious issue.

Modulo scheduling is a well known software pipelining approach for VLIW proces-

sors [81]. It permits multiple loop bodies to be simultaneously in flight within a clustered

VLIW processor. The perception processor discussed in this dissertation relies heavily

on modulo scheduling to achieve high performance. The regular nature of modulo sched-

uled loops makes them amenable to algorithmic level power analysis and optimization.

While the compiler controlled clock-gating explored in this dissertation has been free

of problems, such fine grain management of power could lead to excessive power line

noise otherwise known as the di/dt effect. In such cases it is possible for a compiler to

introduce additional dummy operations into a modulo scheduled loop to reduce power

line disturbance. Yun and Kim present a power aware modulo scheduling algorithm that

could limit power fluctuations [112].

While using custom coprocessors to accelerate applications is a well established idea,

recently researchers have started emphasizing it as a means of reducing power consump-

tion. PipeRench is one such programmable datapath developed at CMU. PipeRench

uses self-controlled runtime reconfiguration and virtualization of hardware to execute

a 40 tap 16 bit FIR filter processing 41.8 million samples per second and the IDEA

19

encryption algorithm at 450 Mbps while operating at 120 MHz [87]. Power consumption

on 15-20 filter taps while operating at 33.3 MHz is in the 600-700 mW range. Pleiades

is a reconfigurable DSP architecture developed at UC Berkeley. It is a domain specific

processor that trades off the flexibility of a general purpose processor for higher energy

efficiency. The Pleiades designers report that their architecture consumes only 14 nano

Joules per stage of an FFT computation while the Intel Strong ARM and the Texas

Instruments TMS320C2xx consume 36 nJ and 15 nJ respectively after normalizing for

CMOS process parameters [4]. The opportunities for special purpose architectures to

improve on the power consumption and the performance of general purpose devices are

numerous. Direct comparison against such systems is often impossible because of their

unavailability and due to the difference in the domains they are targeted at. For this

reason, the approach described in this dissertation will be compared against commer-

cially available general purpose processors and ASIC implementation of algorithms, not

against domain specific accelerators in the literature.

2.6 Distinguishing Features

The perception processor is unique in its use of semiautonomous distributed address

generators and scratch-pad memories to efficiently deliver data to a cluster of execution

units. Like the perception processor, the Imagine Stream Processor and its successor,

the Streaming Super Computer are targeted at stream computing. However, they target

high performance optimizations for multimedia and scientific calculations and are less

concerned with power efficiency. The perception processor, on the other hand, targets

power efficient acceleration of speech recognition and vision applications. Compiler

controlled dataflow is used as a means to mimic custom ASICs, but unlike the transport

triggered MOVE architecture, the perception processor is operation triggered. Unlike

prior research, the fine-grain compiler controlled clock gating described in this disserta-

tion is used not only as a power saving method but also as a means to let software control

the lifetime of values held in pipeline registers. This leads to the ability to schedule

variables in both time and space and harvest the natural register renaming that happens

when a pipeline shifts. Traditional VLIW processors like the Intel Itanium use a rotating

20

register file to accelerate loops. In contrast to traditional architectures, the perception

processor uses a mechanism called array variable rotation to create the equivalent of

multiple virtual rotating registers, one per array variable accessed in a loop body. Most

importantly, an architecture level analysis and optimization of perception applications

and a power efficient, yet programmable architecture designed for a variety of stream

oriented perception and DSP algorithms is the distinguishing mark of this dissertation.

CHAPTER 3

PRINCIPLES BEHIND DYNAMIC POWER

REDUCTION

Power consumption in CMOS circuits consists primarily of static power dissipated

by leakage currents and dynamic power, which is in turn comprised of short circuit

dissipation and the switching power consumed while charging and discharging load

capacitances. Though subthreshold leakage current was a small component of power

consumption in past processes with 0.25µ and larger feature sizes, it is fast becoming a

large component in processes with smaller feature sizes. Architectures that expose power

management to the operating system and application software can play an important role

in reducing leakage power. A combination of software and hardware mechanisms can

intelligently power down parts of a system that are not in active use. The most effective

solutions to the leakage current problem are at the circuit and process level. Circuit

design styles that use gated Vdd and stacked transistors have been shown to greatly

reduce the magnitude of the problem, but they also decrease performance [78, 53].

CMOS processes with multiple threshold voltages (MTCMOS) provide another solution

to the leakage power problem. They also contribute to design flexibility since fast leaky

transistors can be used in critical paths to enhance performance and slow energy efficient

transistors can be used in noncritical parts of the circuit. How to take advantage of this

flexibility in large circuits synthesized from a hardware description language (HDL) is

an area of active research [93].

While a CMOS gate is switching state, there is a short period of time during which

the N and P transistors are simultaneously on, which leads to short-circuit current flowing

between the power and ground terminals. The magnitude of this current increases with

reductions in Vt. It also increases when the rise and fall times of the input waveform

22

are slow [109]. As in the case of leakage current there is very little that can be done at

the architecture level to solve the problem. The process level solution of using high Vt

devices and circuit design styles that ensure rapid rise and fall times alleviate the severity

of the problem.

The architectural options developed in this research are evaluated using transistor

level circuit simulations. These Spice simulations consider both the short circuit and the

leakage components of power consumption. However, since there is not much that can

be done at the architecture level, this research is focused entirely on CMOS dynamic

power dissipated by repeated charging and discharging of load capacitances – a problem

for which architecture level solutions are possible.

3.1 Dynamic Power ConsumptionTo understand how architectural strategies can provide high performance for percep-

tion applications at low power levels, it is necessary to look at the CMOS circuit dynamic

power consumption equation:

P = ACV 2F (3.1)

P is the power consumed, A is the activity factor, i.e., the fraction of the circuit that is

switching, C is the switched capacitance, V is the supply voltage, and F is the clock

frequency [109]. If a capacitance of C is charged and discharged by a clock signal

of frequency F and peak voltage V, then the charge moved per cycle is CV and the

charge moved per second is CV F . Since the charge packet is delivered at voltage V,

the energy dissipated per cycle, or the power, is CV 2F . The data power for a clocked

flip-flop, which can toggle at most once per cycle, will be 12CV 2F . When capacitances

are clock gated or when flip-flops do not toggle every cycle, their power consumption

will be lower. Hence, a constant called the activity factor (0 ≤ A ≤ 1) is used to model

the average switching activity in the circuit. Equation 3.1 is derived by incorporating

this term into the power consumption. Custom ASICs can drastically reduce the power

consumption by using specialized circuit structures and concurrency to lower C and F

respectively. The drawback is that custom ASICs are inflexible and once fabricated, they

23

cannot be reprogrammed. Also, their high production costs and long design times often

make them an unattractive choice. While programmable perception processors are more

desirable than ASICs, ASICs still represent the “gold standard” against which perception

processors should be compared. This is because the specialized nature of an ASIC gives

it significant power, performance and die area advantages when compared to a general

purpose processor. So they represent the best possible implementation of a particular

algorithm for a given CMOS technology.

Assume that an application is required to perform N operations every t seconds to

keep up with real time. Then it should be the case that:

N

IPCavg × F≤ t

IPCavg refers to the average number of instructions issued per second across the

whole application. Further, when NIPCavg×F

< t, the processor has too much perfor-

mance, i.e., its frequency is too high and it wastes power. When handling constant rate

real-time workloads, it is not useful to finish the work early and power down the circuit

till the next real-time deadline. The overhead of reloading state holding data memories

and the instruction memory may be in the range of several thousand cycles. It is better

to slow down the processor to have just enough performance to meet real-time deadlines

rather than paying the reload penalty tens or hundreds of times per second depending on

the nature of the constant rate workload. Thus the ideal frequency of operation is:

Fideal =N

IPCavg × t(3.2)

Substituting this back in the power equation we get:

P = ACV 2 N

IPCavg × t(3.3)

3.2 Power Reduction StrategiesEquations 3.1 and 3.3 point to several power reduction strategies. For instance,

power consumption can be reduced by increasing IPC. However, modern dynamically

24

scheduled processors also increase the value of C when they increase IPC due to the

introduction of large reorder buffers, complex cache structures, register renaming and

support for speculative execution. Architectures that can provide high IPC without

an inordinate rise in the value of C will lead to low power consumption. This can

be achieved at the cost of generality by using simple application domain specific ILP

enhancing mechanisms as well as by taking advantage of compiler driven static ILP

improvements. Increasing the issue width causes some increase in power consumption

because of the wider structures used to support multiple issue. Since most of the ILP

extraction is done at compile time, and because the additional logic can be tailored to

take advantage of domain specific optimizations, the strategy leads to a net power savings

in the end.

Another architectural means of reducing power consumption is to decrease the ac-

tivity factor A. Clock gating provides one method of reducing the activity factor [96].

Designing structures that isolate activity happening in one part from being visible in

other parts is another useful technique. A typical example is the forwarding paths of a

super-scalar microprocessor. A forwarding mux connected to the output of a function

unit makes the value changes occurring in the final stage of that unit visible at the inputs

of other function units even when the receiving units do not need the forwarded value.

This leads to unnecessary switching activity and power dissipation at the receiving side.

When the forwarding path is not needed, the mux select signals can be manipulated

so that unnecessary value changes are not visible at the receiving side. This strategy

called operand isolation was utilized in the IBM PowerPC 4xx embedded controllers

[27]. Operand isolation under compiler control is used as a power saving strategy for the

perception processor described in Chapter 9.

Lowering the ideal operating frequency also permits the use of a lower supply volt-

age, which results in power savings. If frequency is directly proportional to supply

voltage, Equation 3.1 predicts cubic power reduction. However, in reality, f ∝ (V −Vt)Kds

V

where Kds is a device saturation constant whose value ranges from zero to two when ve-

locity saturation is not explicitly modeled [12]. Considering this relationship, quadratic

or linear power savings may be obtained by lowering the supply voltage and operating

25

frequency. This strategy capitalizes on the results produced by researchers exploring

ideal voltage selection and voltage scaling [76]. Equation 3.1 applies only within a

narrow, process specific, supply voltage range.

Ultimately, the average IPC available in an application is limited by the dependences

between instructions. Further improvements may be obtained by multithreading the

application, in which case IPCavg in Equation 3.3 corresponds to the aggregate IPCs

of the individual threads. Traditional high performance multiprocessors exact a high

energy price because of the complexities of memory system coherence and interthread

communication. By tailoring a multiprocessor system to the information flow and syn-

chronization patterns found in perception applications, it is possible to design simple

architectures that provide sufficient generality for the perception domain.

Perception applications are usually stream oriented. They consist of a pipeline of

algorithms, most of which are compute and memory intensive. Each phase typically

touches and discards a large data set in a block oriented manner, i.e., several input blocks

and a few blocks of local state are consulted to compute a block of output. There is little

or no reuse of the high bandwidth input data, which is comprised of both input signals

and massive knowledge bases that are too large to cache on-chip. One or more phases

may be executed on a processor, and multiple processors may be connected in a pipeline

fashion for efficient interphase communication while harvesting thread level parallelism.

3.3 Process Normalization

Comparing the power and performance advantages of any perception-optimized ar-

chitecture to its competition presents some problems. Typically, the competition is a

commercial general purpose processor that is implemented in a different CMOS process

than the one used to implement the perception processor. To make a fair comparison

possible, it is necessary to normalize power and delay of circuits for the minimum feature

size of the CMOS process. Three different scaling regimes will be used to evaluate

the different architectures in Chapter 10 : constant field scaling, voltage scaling and

frequency scaling.

26

3.3.1 Constant Field Scaling

In constant field scaling, when the minimum feature size is scaled from λ to sλwhere

s is a scale factor, the length and width of the channel, the oxide thickness, substrate

concentration density and the operating voltage are all scaled by the same factor s so that

the electric field in the transistor remains constant. The net result is that the dynamic

power consumption P is scaled to s2P , circuit delay T is scaled to sT and operating

frequency F is changed to F/s [109]. Correspondingly, energy consumption scales as s3

and the energy delay product scales as s4. Both the horizontal and vertical electric fields

within a transistor must scale by the same factor for this analysis to hold.

3.3.2 Voltage Scaling

The operating speed of a circuit depends on the supply voltage. It is common practice

to optimize the power performance ratio by adjusting the supply voltage. Noise margins,

transistor threshold voltage and punch through limit the allowable range of supply volt-

age. Within the allowable range, Equation 3.1 predicts that when the supply voltage is

scaled by a factor of sv the dynamic power consumption scales by s2v.

3.3.3 Frequency Scaling

Equation 3.1 predicts that when the supply voltage is held constant and the frequency

is scaled by a factor of sf , the dynamic power consumption scales by sf too.

In reality, most commercial systems do not undergo pure constant field scaling. To

obtain higher performance than promised by constant field scaling, slightly higher supply

voltages and correspondingly higher frequencies are used. This situation can be modeled

as voltage/frequency scaling layered on top of constant field scaling. When the supply

voltage is scaled by a factor of sv and the operating frequency is scaled by sf the dynamic

power consumption scales by s2vsf .

In the interest of obtaining satisfactory noise margin, circuits are typically designed

to operate at voltages that are several times higher than the threshold voltage of tran-

sistors. Since the threshold voltage does not scale as rapidly as transistor feature size,

supply voltage cannot be reduced considerably in the future if noise margins are to be

27

maintained. Combined with the issue of increasing leakage current, these factors indicate

that technology scaling alone may not be adequate to alleviate the power consumption

problems of future systems.

3.4 The ET nMetric

Power consumption, delay, throughput and energy consumption are metrics com-

monly used to compare systems. Considering each of these metrics in isolation does not

permit a fair comparison of systems because of the ability of CMOS circuits to trade

performance for energy. When multiple criteria need to be optimized simultaneously,

it is common to optimize their weighted product. In the case of energy and time, this

product may be represented as the metric M for a circuit configuration C such that:

M(C) = ET n

Here n is a weight that represents the relative importance of the two criteria. The ET n

metric was first proposed by Martin, Nystroem and Penzes [64]. Since energy and time

can be traded off for each other, consider the infinitesimally small quantity of energy

∆E that needs to be expended to reduce the time for a computation by an infinitesimally

small amount ∆T . Using Newton’s binomial expansion and ignoring products and higher

powers of ∆E and ∆T we get:

M(C ′) = (E + ∆E)(T − ∆T )n = ET n − nE∆T + T∆E

If this new operating point is equivalent to the old operating point under the metric M :

ET n − nE∆T + T∆E = ET n

Rearranging this equation yields:

∆E

E=n∆T

T(3.4)

28

Intuitively, this means that a small reduction in time is considered n times more

valuable than a corresponding reduction in energy. For example, if n = 1, a 1% reduction

in time is considered worth paying a 1% increase in energy. If n = 2, then it is acceptable

to pay for a 1% increase in performance with a 2% increase in energy consumption. In

general, when n = 1, energy and delay are equally important, when n > 1 performance

is valued more than energy and when 0 < n < 1 energy savings are considered more

important than performance. The case of n = 0 optimizes just for energy and n = −1

optimizes for power. Other negative values of n are not useful for optimization since

E/T n changes in opposite directions for improvements in energy and delay.

3.5 Energy Delay Squared Product

Martin, Nystroem and Penzes proposed ET 2 as a special case of the ET n metric that

is voltage independent [64]. They proved mathematically that an ET n optimal design

is optimal irrespective of the value of n. There are a few caveats to this result. It

applies only when the circuit is operating within its normal range, i.e., supply voltage

is not close to the threshold voltage or to the velocity saturated region of a transistor.

The intuition behind their formulation is that two circuits with different supply voltages,

power consumptions and performance may be compared by voltage/frequency scaling

the systems until either their supply voltage or their frequency matches. Then the system

with the better power consumption or performance may be picked. Unfortunately, if the

initial difference in performance is too large as in the case of a 2.4 GHz Pentium 4 and a

400 MHz XScale described in Chapter 10, the scaled voltage will be outside the operating

range. For example, if the Pentium operating at 1.6 volts has 10 times the performance of

the XScale, to equalize their performance the Pentium’s voltage needs to be to be scaled

down to approximately 0.16 volts. This assumes that operating frequency scales linearly

with supply voltage, an approximation that applies only in an extremely narrow voltage

range. The new supply voltage of 0.16 volts is bound to be smaller than the threshold

voltage of the 0.13µ CMOS process in which the Pentium is fabricated. So the Pentium

will not operate correctly at that voltage. Since the scaled supply voltage is not within

the normal voltage range, the metric equivalent optimality promised by Martin et al. will

29

not apply.

Results presented in Chapter 10 use E, ET and ET 2 as metrics. The choice of E

gives an advantage to systems like the XScale processor that stress energy efficiency

over performance. The choice of ET favors systems like the perception processor that

value both performance and energy efficiency. ET 2 favors high performance processors

like the Pentium whose design allocates a large expenditure of energy in return for

small improvements in performance. Since the range of supply voltage required to

equalize the performance of the XScale and Pentium systems is outside the operating

range for transistors in the 0.13µ technology in which the Pentium 4 is implemented, this

dissertation uses ET 2 merely as a metric that stresses performance over energy savings.

No claims are made about metric equivalent optimality of the circuits for values of n

other than two.

CHAPTER 4

SPEECH RECOGNITION

Modern approaches to large vocabulary continuous speech recognition are surpris-

ingly similar in terms of their high-level structure [111]. The work described herein is

based on the CMU Sphinx 3.2 system, but the general approach is applicable to other

speech recognizers [49, 74]. The explanation of large vocabulary continuous speech

recognition (LVCSR) in this chapter is based on a simple probabilistic model presented

in [80, 111]. The human vocal apparatus has mechanical limitations that prevent rapid

changes to sound generated by the vocal tract. As a result, speech signals may be

considered stationary, i.e., their spectral characteristics remain relatively unchanged for

several milliseconds at a time. DSP techniques may be used to summarize the spec-

tral characteristics of a speech signal into a sequence of acoustic observation vectors.

Typically, 100 such vectors will be used to represent one second of speech. Speech

recognition then becomes a statistical problem of deriving the word sequence that has

the highest likelihood of corresponding to the observed sequence of acoustic vectors.

This notion is captured by the equation:

W = argWmaxP (W |Y ) (4.1)

Here, W = w1, w2, ..., wn is a sequence of n words and Y = y1, y2, ..., yT is a sequence

of T acoustic observation vectors. Equation 4.1 may be read as W is the particular word

sequenceW which has maximum a posteriori probability given the observation sequence

Y . Using Bayes’ rule, this equation may be rewritten as:

W = argWmaxP (Y |W )P (W )

P (Y )(4.2)

31

P (Y |W ) denotes the probability of the acoustic vector sequence Y given the word

sequence W . P (W ) denotes the probability with which the word sequence W occurs

in the language. P (Y ) denotes the probability with which the acoustic vector sequence

Y occurs in the spoken language. P (Y ) is independent of the word sequence, therefore

W can be computed without knowing P (Y ). Thus Equation 4.2 may be rewritten as:

W = argWmaxP (Y |W )P (W ) (4.3)

The set of DSP algorithms that convert the speech signal into the acoustic vector se-

quence Y is commonly referred to as the front end. The quantity P (Y |W ) is generated

by evaluating an acoustic model. The term P (W ) is generated from a language model.

4.1 Front End

The signal processing front end summarizes the spectral characteristics of the speech

waveform into a sequence of acoustic vectors that are suitable for processing by the

acoustic model. Figure 4.1 shows the stages of this transformation.

Frame Blocking: The digitized speech signal is blocked into overlapping frames. It

is common to have 100 frames per second, so a new frame is started every 10 ms. A new

frame contains the last 7.5 ms of the previous frame’s data and the first 7.5 ms of the

next frame’s data. Thus, even though a new frame is made every 10 ms, each frame is 25

ms in duration. The overlap decreases problems that might otherwise occur due to signal

data discontinuity.

Preemphasis: This stage spectrally flattens the frame using a first order filter. The

transformation may be described as:

Y0[n] = x[n] − αx[n− 1], 0.9 ≤ α ≤ 1, 0 < n < Samples per frame

Here, x[n] refers to the nth speech sample in the frame. Sphinx uses α = 0.97 and the

sampling rate is typically 8K or 16K 16-bit samples per second.

32

Figure 4.1. Signal Processing Front End

Hamming Window: In this stage a Hamming window is applied to the frame to

minimize the effect of discontinuities at the edges of the frame during FFT. The transfor-

mation is:

Y1[n] = x[n] ×H[n], 0 < n < Frame size

The vector H[n] is computed using the following equation.

H[n] = 0.54 − 0.46 × cos(2πn

Frame size− 1)

33

The constants used in the H[n] transform were obtained from the Sphinx source code.

FFT: The frame is padded with enough zeroes to make the frame size a power of two

(call this N ) and a Fourier transform is used to convert the frame from the time domain

to the frequency domain.

Y2 = DFT (Y1)

The square of the magnitude is then computed for each frequency component. Thus the

results are real numbers rather than the complex output produced by a discrete Fourier

transform.

Y3[n] = real(Y2[n])2 + imag(Y2[n])2, 0 < n ≤ N/2

Mel Filter Bank: A set of triangular filter banks is used to approximate the frequency

resolution of the human ear. The Mel frequency scale is linear up to 1000 Hz and

logarithmic thereafter. A set of overlapping Mel filters are made such that their center

frequencies are equidistant on the Mel scale. The transformation is:

Y4[n] =N/2∑

i=0

Y3[i] ×MelWeight[n][i], 0 < n < Number of filters

For 16 KHz sampling rate, Sphinx uses a set of 40 Mel filters.

Log Compression: The range of the values generated by the Mel filter bank is

reduced by replacing each value by its natural logarithm. This is done to make the

statistical distribution of the spectrum approximately Gaussian – a requirement for the

subsequent acoustic model. The transformation is:

Y5[n] = ln(Y4[n]), 0 < n < Number of filters

DCT: The discrete cosine transform is used to compress the spectral information into

a set of low order coefficients. This representation is called the Mel-cepstrum. Currently

34

Sphinx compresses the 40 element vector Y5 into a 13 element cepstral vector. The

transformation is:

Y6 = DCT (Y5)

Numerical differentiation: Acoustic modeling assumes that each acoustic vector is

uncorrelated with its predecessors and successors. Since speech signals are continuous,

this assumption is problematic. The traditional solution is to augment the cepstral vector

with its first and second differentials. Since the Mel cepstral vector is 13 elements long

in Sphinx, after appending the differentials the final acoustic vector that is 39 elements

in length.

Summary: The Sphinx front end transforms a 25 ms speech sample into a 39 element

vector of real numbers that represents the spectral characteristics of the waveform in a

compact form. The speech signal is blocked into overlapping frames spaced 10 ms apart.

Thus the front end transforms one second of speech into a series of 100 acoustic vectors.

Even though the front end only occupies less than 1% of the compute cycles of Sphinx

3.2, it is very important for two reasons.

1. Understanding acoustic vectors is a crucial prerequisite to illustrate the operation

of the acoustic model.

2. The front end is dominated by floating point computations that make it very prob-

lematic to run on embedded processors without floating point hardware. Fixed

point versions are difficult to create and analyze, but have been studied in the

literature. Delaney described a fixed point speech front end for Sphinx which

performed 34 times better on an embedded processor than a floating point front

end that uses software emulated floating point operations [32].

4.2 Acoustic Model

Equation 4.3 needs the quantity P (Y |W ), the probability of an acoustic vector se-

quence Y given a word sequence W to find the most probable word sequence. A

35

simplistic approach to achieve this would be to obtain several samples of each possible

word sequence, convert each sample to the corresponding acoustic vector sequence and

compute a statistical similarity metric for the given acoustic vector sequence Y to the set

of known samples. For large vocabulary speech recognition this is not feasible because

the set of possible word sequences is very large. Instead words may be represented as

sequences of basic sounds. Knowing the statistical correspondence between the basic

sounds and acoustic vectors, the required probability can be computed.

The basic sounds from which word pronunciations can be composed are known as

phones or phonemes. Approximately 50 phones may be used to pronounce any word

in the English language. For example the CMU dictionary enlists the pronunciation for

dissertation as:

DISSERTATION D IH S ER T EY SH AH N

While phones are an excellent means of encoding word pronunciation, they are less than

ideal for recognizing speech. The mechanical limits of the human vocal apparatus leads

to co-articulation effects where the beginning and end of a phone are modified by the

preceding and succeeding phones. Recognizing multiple phone units in context tends

to be more accurate than recognizing individual phones. Current speech recognition

systems deal with three-tuples of phones called triphones. It is customary to denote

triphones as left context−current phone+right context. For example SH-AH+N is

a triphone that represents the context of the AH phone in the word dissertation. The final

N phone in “dissertation” can be modeled with a cross-word triphone whose right context

is the first phone in the next word or by the triphone AH-N+SIL where SIL is a special

phone that denotes silence. Although there are approximately 50 × 50 × 50 = 125, 000

possible triphones, only about 60,000 actually occur in English.

The probability that an acoustic vector sequence corresponds to a particular triphone

may be estimated using a Hidden Markov Model (HMM). Current speech recognizers use

an HMM model with three internal states and an entry and an exit state. The topology

of the HMM is shown in Figure 4.2. An HMM is a probabilistic finite state machine

that generates observation sequences. If the model is in state Si at time step t, then it

36

Figure 4.2. Triphone HMM

has a probability Bi(Yt) of producing the acoustic vector Yt and it switches to state Sj

with probability Aij . The problem of computing P (Y |W ) now becomes what is known

as the evaluation problem for HMMs – the problem of estimating the probability with

which a given HMM could have generated the observation sequence Y . The evaluation

problem can be solved using the Forward/Backward algorithm for HMMs, but since the

optimal state sequence is needed at a later stage, it is common to do a more expensive

Viterbi search which can compute the probability and uncover the optimal state sequence

simultaneously [80].

4.3 Language Model

The accuracy of recognition hypothesis produced by the acoustic model can be fur-

ther enhanced using a language model. The acoustic model might produce several

alternate similar words that the language model helps to disambiguate. Language models

are also useful in limiting search time for beam search based acoustic models. N-gram

models which predict the probability of a word based on the previous N − 1 words are

a common and effective approach. Current systems like Sphinx and HTK favor models

with N=3, which are called trigrams. While there are alternatives to N-gram models that

rely on grammar, syntax, subject verb agreement and trigger words, N-gram models have

the distinct advantage of being easy to train since N-gram probabilities can be easily

estimated from a large corpus of text automatically. A trigram model may be trained

37

simply by using the equation:

P (w3|w1, w2) =F (w1, w2, w3)

F (w1, w2)

Here, F (w1, w2, w3) refers to the frequency of occurrence of the trigram (w1, w2, w3)

in the training text and F (w1, w2) refers to the frequency of occurrence of the bigram

(w1, w2). In practice, for a large vocabulary all possible trigrams will not be present in

the training corpus. In that case bigram or unigram probabilities are used in the place of

trigram probabilities after reducing the probability by a back-off weight, which accounts

for the fact that the next higher n-gram has not been seen and therefore has a lower chance

of occurring.

4.4 Overall OperationHMMs are constructed for all known triphones. A pronunciation dictionary is used

to convert words into triphone sequences with overlapping contexts. For example the

isolated word dissertation whose pronunciation is the phone sequence D IH S ER T

EY SH AH N is expanded to SIL-D+IH, IH-S+ER, S-ER+T, ER-T+EY, T-EY+SH, EY-

SH+AH, SH-AH+N, AH-N+SIL. There are many more expansions corresponding to all

words that could possibly precede or succeed this word in a sentence. These are words

that could end in D+IH or start with AH-N. A data-structure known as a lexical tree

(Sphinx terminology) is constructed, and all words in the dictionary are entered in the

lexical tree. The roots of the tree correspond to the set of all triphones that start any

word in the dictionary. Each node in the tree points to the next triphone in the expanded

pronunciation of a word. Common triphone sequences may be shared within the tree.

The overall effect is that of combining all the triphone HMMs by adding null transitions

between the final states of one triphone HMM to the initial state of its successor. To

model continuous speech, null transitions are added from the final state of each word

to the initial state of all words. Triphones that occur at the end of a word are specially

marked so that a language model may be consulted at those points. Thus the lexical

tree is a multirooted tree where each node points to an HMM and a successor node. In

the case of word exit triphones there are multiple successors. Given an acoustic vector

38

sequence Y , each vector in the sequence is applied successively to the HMMs and the

probability that the HMM generated that vector is noted. Transitions are made in each

step to successor nodes. On reaching a word exit triphone, the state sequence history is

consulted to find the word that has been recognized. The last n words (usually n=3) are

checked against a language model for further analysis. The search is done by means of

a well known dynamic programming algorithm known as Viterbi beam search [74]. The

acoustic and language models are strongly coupled, though language model evaluation

may be deferred until the acoustic model has been evaluated. Together, they consume

almost 99% of the run time of Sphinx.

4.5 Architectural Implications

A basic understanding of the acoustic and language models is necessary to understand

the architectural implications and scaling characteristics of speech recognition. The

lexical tree is a complex data structure that results in considerable pointer chasing at

run time. The nodes that will be accessed depend very much on the sentences being

spoken. The size of the tree depends on the vocabulary size. However there is scope for

architectural optimization. The opportunity stems from the fact that acoustic vectors

are evaluated successively and on evaluating an HMM for the current vector, if the

HMM generates a probability above a certain threshold, the successors of the HMM

will be evaluated in the next time step. Thus there is always a list of currently active

HMMs/lextree nodes and a list of nodes that will be active next. Evaluating each HMM

takes a deterministic number of operations and thus a fixed number of clock cycles. This

information can be used to prefetch nodes ahead of when they are evaluated.

Given the fact that the number of triphones and words in a language are relatively

stable, it might appear that the workload will never expand. In reality this is not the

case due to the probability density function Bi(Yt). In the past, speech recognizers

used subvector quantized models, which are easy to compute. These methods use a

code book to store reference acoustic vectors. Acoustic vectors obtained from the front

end are compared against the code book to find the index c of the closest match. The

probability density function then reduces to a table lookup of the form B[i][c]. While

39

this is computationally efficient, the discretization of observation probability leads to

excessive quantization error and thereby poor recognition accuracy.

To obtain better accuracy, modern systems use a continuous probability density func-

tion and the common choice is a multivariate mixture Gaussian in which case the com-

putation may be represented as:

Bi(Yt) =M∑

m=1

cimN

∑

n=1

(Yt[n] − µim[n])2 × Vim[n] (4.4)

Here, µim is the mean and Vim the variance of the Gaussian mixture and cim is

the weight of the mixture. For The Hub-4 speech database used for this research was

obtained from CMU and they chose M and N to be 8 and 39 respectively. Note that

the outer∑M

m=1 denotes an addition in the logarithmic domain. Normally the inner

term involves exponentiation to compute a weighted Mahalanobis-like distance, but it is

reduced to simple arithmetic operators by keeping all the parameters in the logarithmic

domain [91, 111]. Therefore the outer summation needs to be done in the logarithmic

domain. This may be implemented using table lookup based extrapolation. This strategy

is troublesome if the processor’s L1 D-cache is not large enough to contain the lookup

table.

If each HMM state uses a separate probability density function, then the system is

said to be fully continuous. Thus the peak workload for an English speech recognizer

would correspond to the evaluation of about 60,000 probability density functions and

HMMs, as well as an associated lextree traversal that is proportional to the number of

words in the vocabulary. Fully continuous models are not popular for two reasons:

1. Their computational complexity makes them orders of magnitude slower than real

time on current processors.

2. Their parameter estimation problem and sparse training sets lead to low recognition

accuracy.

The parameter estimation problem is particularly difficult. For M = 8 and N = 39

Equation 4.4 needs 39× 2× 8 + 8 = 632 parameters for the values of µim, Vim and cim.

40

For a total of 60,000 triphones this adds up to 113.7 million parameters. The training data

is often insufficient to estimate that many parameters, so the use of continuous models

leads to increased word error rate. The usual solution is to cluster together HMM states

and share a probability density function among several states. Such clustering methods

are an area of active research. A speech recognition system that uses clustered probability

density functions is called a semicontinuous or tied-mixture system. Almost all advanced

large vocabulary speech recognizers currently fall in this category. The Hub-4 speech

model used to evaluate Sphinx 3.2 contains approximately 6000 probability density

functions representing an average of 30 HMM states sharing a single function. This

ratio could change when a model is trained on a larger data set leading to proportionately

increased compute complexity. Another possibility is an increase in M , the number of

mixtures per function, which will again proportionately increase the compute cycles. A

third possibility is increasing the size of the context from triphones to quinphones (five

phones, one current phone and two left and two right neighbors). The use of quinphones

will lead to an increase in the number of probability density functions that need to be

evaluated. This will be further multiplied by the number of quinphones in the language

vs the number of triphones.

Though traditional speech recognizers couple the evaluation of HMMs and Gaussians

tightly, in the interest of extracting greater levels of thread parallelism, it is possible to

decouple HMM and Gaussian evaluation, an approach that will be further investigated in

Chapter 5.

CHAPTER 5

CHARACTERIZATION AND OPTIMIZATION OF

SPHINX 3

Chapter 4 described the Front end (FE), Gaussian (GAU) and Search (HMM) phases

of the Sphinx 3.2 speech recognition system. To fully characterize the complex behavior

of Sphinx, it is necessary to study the individual phases separately. In addition to the

FE, GAU and HMM phases, Sphinx has a lengthy startup phase and extremely large

data structures which could cause high TLB miss rates on embedded platforms with

limited TLB reach. To avoid performance characteristics being aliased by startup cost

and the TLB miss rate, Sphinx was modified to support check-pointing and fast restart.

For embedded platforms, the check-pointed data structures may be moved to ROM in

a physically mapped segment similar to kseg0 in MIPS processors [71]. Results in this

chapter are based on this low startup cost version of Sphinx, referred to as original.

Previous studies have not characterized the three phases separately [6, 59]. To capture

the phase characteristics and to separate optimizations for embedded architectures, a

phased version of Sphinx was developed so that each of the FE, GAU and HMM phases

can be run independently with input and output data redirected to intermediate files.

In the rest of this chapter FE, GAU and HMM refers to the corresponding phase run

in isolation while phased refers to all three chained sequentially with no feedback. In

phased, FE and HMM are identical to original, while the work load of GAU is increased

by the lack of dynamic feedback from HMM. Breaking this feedback path exposes

parallelism in each phase and allows the phases to be pipelined. GAU OPT refers to

a cache optimized version of the GAU phase alone. PAR runs each of the FE, GAU OPT

and HMM phases on separate processors. It also uses the same cache optimizations as

GAU OPT.

42

Both simulation and native profiling tools were used to analyze Sphinx 3. Simulations

provide flexibility and a high degree of observability, while profiled execution on a real

platform provides realistic performance measures and serves as a way to validate the

accuracy of the simulator. The configurations used to analyze Sphinx 3 are shown in

Table 5.1.

A multi-GHz processor is required to operate Sphinx in real time. Parameters like L1

cache hit time, memory access time and floating point latency were measured on a 1.7

GHz AMD Athlon processor using the lmbench hardware performance analysis bench-

mark [68]. Numbers that could not be directly measured were obtained from vendor

microarchitecture references [51, 5]. The Simplescalar simulator was then configured to

reflect these parameters [19]. Unless mentioned otherwise, the remainder of this chapter

uses the default configuration from Table 5.1.

Native profiling indicates that the original Sphinx spends approximately 0.89%, 49.8%

and 49.3% of its compute cycles in the FE, GAU and HMM phases respectively. Another

recent study found that as high as 70% of another speech recognizer’s execution time was

Table 5.1. Experiment Parameters

Native Execution:SGI Onyx3, 32 R12K processors at 400 MHz32 KB 2-way IL1, 32 KB 2-way DL1, 8 MB L2Software: IRIX 64, MIPS Pro compiler, Perfex, SpeedshopSimulator: (default configuration)SimpleScalar 3.0, out of order CPU model, PISA ISA8 KB 2-way IL1, 2 cycle latency, 32 KB 2-way DL1, 4 cycle latency2 MB 2-way L2, 20 cycle latency, 228 cycle DRAM latencyL1 line size 64 bytes, L2 line size 128 bytesSoftware: gcc 2.6.3ILP Experiment ConfigurationsReasonable configuration:32 KB DL1, 4 cycle latency, 2 MB L2, 20 cycle latency2 memory portsAggressive configuration:32 KB DL1, 2 cycle latency, 8 MB L2, 20 cycle latency4 memory ports

43

spent in Gaussian probability computation [59]. In the phased version approximately

0.74%, 55.5% and 41.3% of time was spent in FE, GAU and HMM respectively. Since

FE is such a small component of the execution time, the rest of this work excludes it and

concentrates on the analysis of the GAU and HMM phases.

5.1 Memory System Behavior

Figures 5.1 and 5.2 show the L1 Dcache and L2 cache miss rates for original, phased,

FE, HMM and GAU for a variety of configurations. Since earlier studies showed that

larger line sizes benefit Sphinx II, 64 byte L1 and 128 byte L2 cache line sizes were

chosen [6]. In addition, the L2 cache experiments assume a 32 KB L1 Dcache. Both

figures assume an 8 KB Icache. Since Sphinx has an extremely low instruction cache

miss rate of 0.08% for an 8 KB Icache, no other Icache experiments were done. The

8 KB16 KB

32 KB64 KB

32 KB SGI

L1 Data Cache Size

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Mis

s R

ate

(Per

cent

) 9.56

%

7.72

%

6.56

%

5.96

%

10.4

9%

9.42

%

7.52

%

6.33

%

5.70

%

10.3

2%

8.80

%

7.22

%

6.22

%

4.64

%

8.16

%9.45

%

2.79

%

0.69

%

0.36

%

4.12

%

9.47

%

7.27

%

5.91

%

5.25

%

9.43

%

8.85

%

6.73

%

5.75

%

3.65

%

6.38

%

9.29

%

8.53

%

7.74

%

7.20

%

12.9

2%OriginalPhasedPhased OPTFEGAUGAU OPTHMM

Figure 5.1. L1 Dcache Miss Rate

44

256 KB512 KB

1 MB2 MB

4 MB8 MB

SGI 8 MB

L2 Cache Size

0

5

10

15

20

25

30

35

40

45

50

Mis

s R

ate

(Per

cent

)

44.1

3%

40.8

7%

38.3

1%

35.4

5%

30.8

8%

21.8

4%

17.7

8%

42.8

9%

40.1

5%

37.4

7%

34.3

9%

29.4

2%

17.9

0%

13.2

9%

25.3

6%

22.4

6%

19.8

3%

16.8

5%

13.1

5%

9.63

%

9.19

%

5.46

%

5.00

%

1.80

%

1.54

%

1.39

%

0.92

%

0.28

%

41.8

6%

41.1

7%

40.4

8%

39.9

8%

36.8

7%

20.2

7%

11.8

0%

8.65

%

7.67

%

7.05

%

6.70

%

6.21

%

5.21

%

4.06

%

44.7

8%

39.6

1%

34.4

8%

28.0

7%

19.9

5%

11.4

1%

11.6

3%

OriginalPhasedPhased OptFEGAUGAU OPTHMM

Figure 5.2. L2 Cache Miss Rate

SGI data provide a reality check since they represent results obtained using hardware

performance counters. The SGI L2 results are very similar in character to the 8 MB

simulation results in spite of the effects of out of order execution, memory system latency

and differences in cache replacement policy. The L1 results are not directly comparable

since the R12000 uses a 32 byte L1 line size and suffers from cache pollution induced

by abundant DTLB misses.

Figure 5.3 shows the average bandwidth required to process the workload in real

time. This is obtained by dividing the total L2 to memory traffic while Sphinx operates

on a speech file by the duration in seconds of the speech signal. The evidence suggests

that bandwidth starvation leading to stalls on L2 misses is the reason this application

is not able to meet real-time requirements. The memory bandwidth required for this

application is several times higher than what is available in practice. Note that available

bandwidth is always significantly less than the theoretical peak on most architectures. A

16-fold improvement in L2 size from 256 KB (the L2 size of a 1.7 GHz Athlon) to 8 MB

(SGI Onyx) produces only a very small decrease in the bandwidth requirement of GAU.

45

256 KB512 KB

1 MB2 MB

4 MB8 MB

SGI 8 MB

L2 Cache Size

0

250

500

750

1000

1250

1500

1750

2000

L2 to

Mem

ory

Ban

dwid

th (

MB

/s) 15

84

1473

1383

1277

1111

790

791

1895

1776

1654

1502

1261

773

766

1243

1118

1001

854

661

468

468

824

810

795

785

724

399

363

174

154

141

134

123

103

86

1068

962

853

705

505

289

305

OriginalPhasedPhased OptGAUGAU OptHMM

Figure 5.3. L2 to Memory Bandwidth

This phase essentially works in stream mode making 100 sequential passes per second

over a 14 MB Gaussian table. The speech signal itself contributes only 16 KB/s to the

total bandwidth requirements. Some computation saving heuristics in Sphinx also have

the beneficial side effect of helping to save bandwidth by not touching blocks that are

deemed improbable. Until the L2 size reaches 8 MB, long term reuse of Gaussian table

entries in the L2 is infrequent. It should be noted that the bandwidth requirement of GAU

in isolation is more severe than if it were operating inside original, since feedback driven

heuristics cannot be applied.

5.2 ILP in Sphinx

Before exploring special-purpose architecture extensions for speech, it is worthwhile

to investigate the limits of modern architectures. GAU is a floating point dominant code

while HMM is dominated by integer computations. GAU also appears to be easily vec-

46

torizable. Two simulation studies were undertaken to explore possibilities for extracting

ILP. For GAU, a surplus of integer ALUs was provided and the number of floating point

units was varied. Since this algorithm uses an equal number of multiplies and adds, the

number of floating point adders and multipliers were increased in equal numbers from

one to four, which corresponds to the X axis varying from two to eight FPUs in Figure

5.4. Two different memory system hierarchies were considered: a reasonable one for

a multi-GHZ processor and an aggressive memory system with lower latencies. Both

configurations are summarized in Table 5.1.

The SGI-2+2f entry describes the measured total IPC on the R12000, which has two

integer and two floating point units. The SGI-2 entry is the measured floating point IPC

alone. In the case of GAU, IPC remains low because of insufficient memory bandwidth

to keep the FPUs active. In the case of the R12000, which can issue two floating

point operations per cycle, the IPC for this loop is an underwhelming 0.37. GAU OPT,

uncovers opportunities for ILP by virtue of its cache optimizations thereby improving

IPC greatly. However, the IPC saturates at 1.2 in spite of available function units. A

2 4 6 8

SGI 2+2fSGI 2f

Number of FPUs

0

0.5

1.0

1.5

2.0

GA

U IP

C

0.56

0.58

0.58

0.59

1.20

0.37

0.77

0.81

0.82

0.82

ReasonableAggressive

2 4 6 8

SGI 2+2fSGI 2f

Number of FPUs

0

0.5

1.0

1.5

2.0

GA

U O

PT

IPC

1.02 1.10

1.10

1.10

1.74

0.55

1.09 1.

20

1.23

1.23

Figure 5.4. GAU and GAU OPT IPC

47

recently published study also indicated IPC in the range of 0.4 to 1.2 for another speech

recognizer [59]. Clearly, the architecture and compiler are unable to automatically extract

the available ILP, which again argues for custom acceleration strategies.

Figure 5.5 shows the corresponding experiment for the HMM phase. In this experi-

ment, the number of integer adders and multipliers are varied equally from one to four.

In spite of available execution resources, IPC remains low. It should be noted that in both

experiments, the SGI results are indicative of cases where the CPU to memory clock ratio

is low. This ratio will undoubtedly increase in the future.

The observations from sections 5.1 and 5.2 have several implications:

1. If speech is an “always on” background application, it could cause significant L2

cache pollution and memory bandwidth degradation to the foreground application.

To guarantee real-time processing, it might be better to stream data around the L2

rather than pollute it.

2 4 6 8SGI -

2

Number of ALUs

0

0.2

0.4

0.6

0.8

1.0

HM

M IP

C

0.30 0.33

0.34

0.34

0.58

0.46 0.

54

0.56

0.57

ReasonableAggressive

Figure 5.5. HMM IPC

48

2. Since the L2 cache is one of the largest sources of capacitance on the chip, ac-

cessing it for stream data incurs a large power overhead. Low power embedded

platforms may not need any L2 cache at all since dramatic increases in L2 size are

not accompanied by corresponding improvements in DRAM bandwidth require-

ments or performance.

3. Bandwidth reduction is important for its own sake as well as to reduce power

consumption. Bandwidth partitioning so that each phase has independent access

to its data set is important.

5.3 Results of Software Optimizations

Since Sphinx was shown to have bad cache behavior, cache optimizations were

investigated. To extract greater levels of parallelism the application was multithreaded.

5.3.1 Cache Optimizations

In Section 5.1, GAU was shown to be bandwidth starved. The GAU code in phased

was instrumented and found to require approximately twice the amount of computation

as in original. However, Figure 5.6 shows that phased suffers only 0.85 times slow down

over original on an R12000. Clearly, a large fraction of the excess computation is hidden

by memory latency. With processor to memory speed ratios increasing in the future, an

out of order processor can hide an even larger amount of compute overhead. The key is

to improve the memory system behavior without an unreasonable increase in compute

requirements.

To achieve this goal, two transformations were performed on phased. First, a block-

ing optimization similar in spirit to loop tiling was performed, which delays the initial

speech signal by 100 ms or 10 frames. The Gaussian probabilities for all 10 frames are

computed by making a single pass over the Gaussian tables. This effectively reduces the

number of passes to 10 per second where original would have done 100. The blocking

factor is limited to 10 to avoid a perceptible real-time lag at the decoder output.

It should be noted that this is not a blocking or tiling transformation that a compiler

could perform. The software had to be restructured to accumulate 10 frames of the

49

Original

Phased OptPar

Amdhal

Real time

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Spe

edup

1.00

0.85 1.

05

1.67 1.

97

2.79

Figure 5.6. Measured Speedup on R12K

speech signal and to process 10 frames in one pass. Further, this became possible only

because the feedback between HMM and GAU was eliminated. Speech researchers ad-

vancing the state of their art are unlikely to be interested in or aware of architectural level

implications. Thus, it is imperative that architecture researchers analyze the performance

implications of important perception applications like speech recognition.

Sphinx allocates the mean and variance vectors used for Gaussian computation de-

scribed in Section 4.5 separately. Every component evaluation consumes one mean and

one variance vector. Since Sphinx originally allocated each table of vectors separately

and each is more than 7 MB, they potentially conflict with each other in the cache. To

avoid this, corresponding mean and variance vectors were interleaved and padded with

an additional 64 bytes to be exactly three L2 cache lines long. This padding strategy

consumes bandwidth but simplifies DMA transfers for the coprocessor architecture de-

scribed later. The optimized version appears in Figure 5.7. Note the interleaving of

vectors and a blocking loop that is not present in Equation 4.4. The optimized version

appears in Figures 5.1, 5.2, 5.3 and 5.6 as the data point GAU OPT.

50

for(senone = 0; senone < N; senone++) // Loop 0for(block=0; block < 10; block++) // Loop 1

for(c=0; c < 8; c++) // Loop 2{for(i=0, sum=0.0; i < 39; i++) // Loop 3

{t = X[block][i] -

Gautable[senone][c].vector[i].Mean;sum += t * t *

Gautable[senone][c].vector[i].Var;}

sum = max(sum, MINIMUM_VALUE);sum = sum * Gautable[senone][c].FinalScale +

Gautable[senone][c].FinalWeight;score[senone][block] =

log_add(score[senone][block], sum);}

Figure 5.7. Cache Optimized Gaussian Algorithm

GAU OPT demonstrates the true streaming nature of GAU. Figure 5.3 shows that

GAU OPT uses a factor of 4.7 to 3.9 less bandwidth than GAU in simulation with a

factor of 4.2 improvement obtained on a real machine. This supports the claim that GAU

processing can be done without an L2 cache. With a 256 KB L2 cache, the GAU OPT

bandwidth is 174 MB/s. Calculations show that without a heuristic, and without an L2

cache, GAU OPT can meet its real-time requirements with 180 MB/s of main memory

bandwidth. This has important implications for the scalability of servers that process

speech.

Figures 5.1 and 5.2 show dramatic reduction in the cache miss rates in both simulation

and native execution. The L2 native execution results are better than simulation results.

The large variation in the L1 results is due to the 32 byte L1 line size on the R12000 and

also possibly because of an extremely large number of TLB misses. The software TLB

miss handler could easily pollute the L1 cache. The important point is that Figure 5.6

shows that OPT, a version of phased with the GAU OPT blocking optimization, achieves

a slight speedup over original despite performing a larger number of computations.

51

In summary, to be able to extract parallelism, the feedback loop was broken, which

approximately doubled the GAU workload. With cache optimizations (which are not

possible with feedback), the loss due to the extra GAU workload is recovered and the

exposed parallelism is now open for further optimization.

5.3.2 Parallelization

Based on the percentage of execution time, Amdahl’s law predicts a factor of 1.97

speedup if GAU and HMM processing could be entirely overlapped. It is clear that a

special-purpose architecture for GAU can have significant speedup, as well as power and

scaling benefits. Sphinx was multithreaded to see if there were any practical impediments

to achieving good speedup. The parallel version of Sphinx, called PAR, runs each of

the FE, GAU OPT and HMM phases on separate processors. In effect, this models an

SMP version of Sphinx 3 as well as the case where each processor could be replaced by

a special-purpose accelerator. As shown in Figure 5.6, the parallel version achieves a

speedup of 1.67 over the original sequential version. A custom accelerator will likely be

even better. The HMM phase was further multithreaded to use four processors instead of

one, but the resulting five processor version was slower than the two processor version

due to high synchronization overhead.

5.4 The HMM Phase

The HMM related data structure used in Sphinx consists of two components, the

actual Markov model data and lexical tree information attached to each node. While the

data layout itself seems to be well suited for a Dcache, separating out the lexical and

Markov model information could possibly lead to better cache behavior. Since such a

change would entail major restructuring of the application, it was not studied. HMM

evaluation can also benefit from from special-purpose acceleration. To avoid having to

rewrite Sphinx entirely, the HMM related data was transcribed to a new database and the

HMM routine was accelerated in isolation. The results may be seen in Chapter 10.

CHAPTER 6

A CUSTOM GAUSSIAN ACCELERATOR

Chapter 4 introduced the use of multivariate mixture Gaussians in the acoustic model

evaluation of Sphinx 3.2 and indicated that this computation is common to other speech

recognition systems like HTK and the ICRC recognizer [59, 111]. Chapter 5 showed

that 55.5% of the execution time of Sphinx 3.2 was spent in Gaussian computation

when using the Hub-4 speech model. The high percentage of execution time spent

in this computation together with its applicability to a variety of speech recognizers

argues for special acceleration hardware for mixture Gaussians. Accelerators may be

implemented as custom nonprogrammable circuits or as domain specific programmable

processors. The custom circuit option will represent a practical upper bound on achiev-

able performance and energy efficiency. The programmable option which sacrifices some

performance and energy to gain generality will be explored in Chapter 9. This chapter

describes how a high throughput custom datapath is able to achieve area, power and

bandwidth efficiency as well as scalability by means of:

1. Reducing floating point precision.

2. Restructuring the computation.

3. Sharing memory bandwidth.

The Sphinx source code uses floating point computation sparingly, favoring scaled in-

teger arithmetic wherever possible. GAU and FE are the only floating point dominant

computations in Sphinx. An attempt was made to convert GAU to use fixed point integer

arithmetic. This failed because GAU requires a high dynamic range, which cannot

be provided by 32-bit scaled integer arithmetic. Fortunately, the scores of the highly

53

probable states are typically several orders of magnitude higher than those of the less

likely ones, indicating that a wide range is more important than precision.

Earlier work by Pihl explored the use of special-purpose floating point formats in

Gaussian estimation to save memory bandwidth [77]. Special floating point formats

should be almost invisible to the application so that speech models may be developed

without access to any special hardware. A custom software floating point emulation

library was developed to conduct an empirical search for the precision requirements of

the GAU phase. The library supported multiplication, addition, MAC, and (a− b)2 oper-

ations on IEEE 754 format floating point numbers. The approach was to experimentally

reduce mantissa and exponent sizes without changing the output results of the Sphinx 3

recognizer. The result was a reduced precision floating point format similar to the IEEE

754 format which has a sign-bit, an 8-bit excess 127 exponent and a hidden one-bit in its

normalized mantissa. Unlike IEEE 754, which has 23 explicit-bits in the mantissa, the

new format used only 12 bits. Conversion between the reduced precision representation

and IEEE 754 was done by truncating the extra mantissa bits when converting from

IEEE 754 to the new format and concatenating additional 0 bits when converting from

the new format to IEEE 754. Such a transformation can be done within a floating point

unit without any changes being visible to the application. Though this work was done

independently, it is worthwhile to note that a previous study arrived at similar conclusions

based on an earlier version of Sphinx [97]. However that research used digit serial

multipliers, which cannot provide the kind of throughput required for GAU computation.

Hence the accelerator discussed here uses fully pipelined reduced precision multipliers

instead.

Another key insight is that current high performance microprocessors provide a fused

multiply add operation that would benefit GAU. However, GAU also needs an add mul-

tiply (subtract-square) operation. There is scope for floating point circuit improvements

relying on the nature of (a − b)2 always returning a positive number. Further gains can

be obtained in area, latency, power and the magnitude of the numerical error by fusing

the operations (a− b)2 ∗ c. This is the approach used in this research.

54

6.1 Top Level Organization

Figure 6.1 illustrates the system context for the GAU accelerator. Figure 6.2 shows

the details of the accelerator itself. Loops 1, 2 and 3 from the optimized GAU algorithm

in Figure 5.7 are implemented in hardware. The outer loop and the log add step, which

consists of integer subtract, table lookup and integer add, are implemented in software.

The max operation can be folded into the de-normal floating point number handling

section of the floating point adder without additional latency, but empirically it can be

discarded without sacrificing recognition accuracy. The organization in Figure 6.1 is

essentially a decoupled access/execute architecture [88]. The outer loop runs on a host

processor and instructs a DMA engine to transfer X, Mean and Var vectors into the

accelerator’s input memory. A set of 10 input blocks are transferred into the accelerator

memory and retained for the duration of a pass over the entire interleaved Mean/Var

table. The Mean/Var memory is double buffered for simultaneous access by the DMA

Figure 6.1. Top Level Organization of Gaussian Estimator

55

Figure 6.2. Gaussian Coprocessor

engine and the accelerator. The accelerator sends results to an output queue where they

are read by the host processor using its coprocessor access interface.

6.2 Coprocessor Datapath

Figure 6.2 shows the architecture of the accelerator. The datapath consists of an

(a−b)2×c floating point unit, followed by an adder that accumulates the sum as well as a

fused multiply add (a×b+c) unit that performs the final scaling. Given that X, Mean, and

Var are 39-element vectors, a vector style architecture is suggested. The problem comes

in the accumulation step, since this operation depends on the sum from the previous

cycle, and floating point adders have multicycle latencies. For a vector length of N and

an addition latency of M, a straightforward implementation takes (N − 1) ×M cycles.

Binary tree reduction (similar to an optimal merge algorithm) is possible, but even then

the whole loop cannot be pipelined with unit initiation interval.

This problem is solved using by reordering Loops 1,2,3 to a 2,3,1 order. This cal-

culates an (X − M)2 × V term for each input block while reading out the mean and

variance values just once from the SRAM. Effectively this is an interleaved execution of

10 separate vectors on a single function unit, which leaves enough time to do a floating

56

point addition of a partial sum term before the next term arrives for that vector. The cost is

10 internal registers to maintain partial sums. Loops 2,3,1 can now be pipelined with unit

initiation interval. In the original algorithm, the Mean/Var SRAM is accessed every cycle

whereas with the loop interchanged version this 64-bit wide SRAM is accessed only once

every 10 cycles. Since SRAM read current is comparable to function unit current in the

CMOS technology used for this design, the loop interchange also contributes significant

savings in power consumption.

The Final Sigma unit in Figure 6.2 works in a similar manner, except that instead of a

floating point adder, it uses a fused multiply add unit. It scales the sum and adds the final

weight. This unit has a fairly low utilization since it receives only 8 × 10 inputs every

39× 10× 8 cycles. To save power this unit is disabled when it is idle. In a multichannel

configuration it is possible to share this unit between multiple channels. To reduce the

number of reads the processor needs to perform to fetch results from the accelerator, this

unit may be made to accumulate the final score. This also serves to reduce the outgoing

bandwidth from the processor by a factor of eight. In that case, due to the interleaved

execution this unit also requires 10 intermediate sum registers. Log domain addition can

be implemented using an integer subtract, table lookup and an integer add operation.

The state machine needs to be adapted to recirculate the results through the the integer

add/subtract unit within the floating point adder. The lookup table used for extrapolation

is constant and can therefore be implemented as optimized logic within the state machine.

In this design, log domain addition is implemented in software.

6.3 Implementation

The datapath shown in Figure 6.2 was implemented using a datapath description

language (Synopsys Module Compiler Language) and is subsequently synthesized for

a 0.25µ CMOS process. The control sections were written in Verilog and synthesized

using the Synopsys Design Compiler. The gate level netlist is then annotated with worst

case wire loads calculated using the same wire load model used for synthesis. The netlist

is then simulated at the Spice level using Synopsys Nanosim and transistor parameters

extracted for the same 0.25µ MOSIS process. Energy consumption is estimated from

57

the RMS supply current computed by Spice. The unoptimized fully pipelined design can

operate above 300 MHz at the nominal voltage of 2.5 volts with unit initiation interval. At

this frequency the performance exceeds the real-time requirements for GAU, indicating

an opportunity to further reduce power. A lower frequency and voltage can be used to

further reduce power.

A low power processor similar to a MIPS R4600 was designed for use as a control

processor. The MIPS was chosen because it is commonly used in embedded systems

and also because high performance implementations of the MIPS ISA, like the R12K,

were readily available for experiments. The design of this processor was done in such a

way that it could be easily modified for tight integration with ASIC coprocessors. The

Gaussian accelerator was designed and attached to the control processor as a custom

coprocessor, and the combination was then simulated. The control processor is a sim-

ple in-order design that uses a blocking L1 Dcache and has no L2 cache. To support

the equivalent of multiple outstanding loads, it uses the MIPS coprocessor interface to

directly submit DMA requests to a low priority queue in the on-chip memory controller.

The queue supports 16 outstanding low priority block read requests with block sizes that

are multiples of 128 bytes. A load request specifies a ROM address and a destination –

one of the Feat, Mean or Var SRAMs. The memory controller initiates a queued memory

read and transfers the data directly to the requested SRAM index. A more capable out

of order processor could initiate the loads directly. Software running on the processor

core does the equivalent of the GAU OPT phase. It accumulates 100 ms or 10 frames

of speech feature vectors (1560 bytes) into the Feat SRAM whenever the accelerator

has finished processing the previous block of input. Currently, the accelerator functions

faster than its real-time requirement. It is possible to slow down the accelerator so that

it completes the processing of each block just by the time the next block of input is

ready, but this has not been attempted. The data transfer uses the memory controller

queue interface. Next, it loads two interleaved Mean/Var vectors from ROM into the

corresponding SRAM using the queue interface. A single transfer in this case is 640

bytes. The Mean/Var SRAM is double buffered to hide the memory latency. Initially,

the software fills both the buffers. It then queues up a series of vector execute commands

58

to the control logic of the Gaussian accelerator. A single command corresponds to

executing the interchanged loops 2,3,1. The processor then proceeds to read results from

the output queue of the Gaussian accelerator. When 10 results have been read, it is time

to switch to the next Mean/Var vector and refill the used up half of the Mean/Var SRAM.

This process continues until the end of the Gaussian ROM is reached. When one cache

line of results has been accumulated, they are written to the output queue where another

phase or an I/O interface can read them.

Calculations based on the throughput of the accelerator showed that it needed to

operate at 202 MHz to achieve real-time speech processing. To simplify the electrical

interface between the processor and the coprocessor, both circuits need to operate at the

same clock frequency. Since the processor runs a general purpose operating system,

events like clock ticks and background tasks sometimes interrupt the main program that

transfers data between main memory and the input and output queues. Additional head-

room is required so that these interruptions do not prevent real-time processing of the

speech data. The extra performance required from the processor depends on the mix

of control tasks running on the processor. When the accelerator is scaled to process

multiple channels the processor needs to have commensurate processing ability too. So

the operating frequency of the system was chosen to be as high as possible subject to

the limitations of the 0.25µ process. The maximum frequency at which the circuits were

stable was 300 MHz. A cycle accurate simulator was developed and validated by running

it in lock step with the processor’s HDL model. The simulator was detailed enough to

boot the SGI Linux 2.5 operating system and run user applications in multitasking mode.

The resulting system accurately models the architecture depicted in Figures 6.2 and 6.1.

The GAU OPT application for this system is a simple 250 line C program with fewer than

10 lines of assembly language for the coprocessor interface. Loop unrolling and double

buffering were done by hand in C. The application was compiled using MIPS GCC 3.1

and run as a user application under Linux inside the simulator. It was able to process 100

ms samples of a single channel in 67.3 ms and scale up to 10 channels in real time. The

actual data may be seen in Section 6.5.2.

59

6.4 Applications

Though the Gaussian estimator was designed for Sphinx 3 and the MIPS-like embed-

ded processor, the results are widely applicable to other architectures and recognizers.

There are several levels at which this system may be integrated into a speech recognition

task pipeline similar to Phased. For example, an intelligent microphone may be created

by using a simple low power DSP to handle the A/D conversion and FE phase, and then

a GAU coprocessor attached to the DSP may be used for probability estimation. The

probability estimates can then be sent to a high-end processor or custom accelerator that

does language model computation. The GAU coprocessor can then hide more than 50%

of the compute effort required for speech recognition. On desktop systems, the Gaussian

accelerator may be part of a sound card or the Gaussian accelerator may be directly

attached to the main processor. On commercial voice servers, the Gaussian estimator

may be directly built into the line cards that interface to the telephone network thereby

freeing up server resources for language model and application processing. This also has

important implications for server scalability, discussed in the Section 6.5.2.

6.5 Accelerator Evaluation

The main contributions of the coprocessor architecture are energy savings, server

scalability and bandwidth savings. Each of these advantages is elaborated in the follow-

ing sections.

6.5.1 Energy Savings

The Spice simulation results from the fully synthesized coprocessor architecture

were compared against an actual 2.4 GHz Pentium 4 system that was modified to allow

accurate measurement of processor power. Without considering the power consumed by

main memory, the GAU accelerator consumed 1.8 watts while the Pentium 4 consumed

52.3 watts during Gaussian computation, representing a 29-fold improvement. The

performance of the Pentium 4 system exceeded real-time demands by a factor of 1.6

while the coprocessor approach exceeded real time by 1.55. However the Pentium

4 is implemented in a highly tuned 0.13µ process whereas the GAU accelerator was

60

automatically synthesized for a generally available TSMC 0.25µ process. When normal-

izing for process differences, the advantage of the GAU coprocessor approach increases

significantly. After normalizing for the process, the coprocessor’s throughput is 187%

higher than the Pentium 4, while consuming 271 times less energy. It is important to

note that energy consumption vs. performance is a common design trade-off. A more

valid comparison is the energy-delay product. The GAU coprocessor improves upon the

energy-delay product of the Pentium 4 processor by a factor of 507.

However the processor is only part of any system. Main memory is an impor-

tant consideration as well. This includes the power dissipated by a memory controller,

DRAM chips and the memory bus. It is difficult to estimate this accurately. Since

the XScale processor has an on-chip memory controller, the power consumption on an

XScale system accessing DRAM at peak bandwidth was measured. The main memory

component of power consumed by Gaussian computation was calculated based on that

measurement at the rate of 0.47 W per 64 MB/s of DRAM bandwidth. When the memory

is included the GAU coprocessor approach improves upon the Pentium’s energy delay

product by a factor of 196 and has an energy advantage of a factor of 104, and the

throughput performance stays the same as the processor-only results.

A Pentium 4 was used as the comparison because embedded processors like the

XScale do not have either the floating point instructions or the performance required

for the benchmarks. Software emulated floating point could possibly bloat the energy

delay product of the XScale and make a meaningful comparison impossible. Another

reason for the choice was simply the technical feasibility of measuring processor power.

For example, the Intel XScale development platform used in this research had a pro-

cessor module board with FPGA, Flash memory, etc., integrated on it, and isolating the

processor power was difficult. The particular Pentium 4 system was chosen because

the layout of the printed circuit board permitted modifications to permit measuring the

energy consumption of the processor core alone.

61

6.5.2 Scalability

As natural human interfaces become more common, scalability of servers that pro-

cess speech will become an important issue. This will be particularly important for

systems like call centers and collaborative work environments. In addition to having

energy advantages, the design is also scalable. Figure 6.3 shows that the system can be

scaled to process five independent speech channels in real time. The main limitation is

the in-order processor with its simple blocking cache mode. This is evident from the

difference in performance between the first and second bars in each data set. At six

channels, the system is seen to be slightly slower than real time. However, an ideal L1

D-cache which always reports a cache-hit and never writes data back to memory is seen

to scale up to 10 channels or more. A Final Sigma stage that implements log domain

addition enables the design to scale even with blocking caches due to the removal of

destructive interference between the cache and the DMA engine. The Final Sigma stage

reduces the number of results that need to be stored in the cache by a factor of eight. With

this optimization the system is able to process 10 or more channels of speech signals.

1 2 3 4 5 6 7 8 9 10

Number of channels

0

20

40

60

80

100

120

140

Pro

cess

ing

time

per

10 fr

ames

(m

s)

67.3

67.3

67.3 77

.7

93.4

109.

1

67.3

67.3

67.3

67.3

67.3

67.3

67.3

67.3

67.3

70.9

64.8

65.3

65.1

65.4

65.1

65.1

65.5

65.5

65.1

65.2

No Sigma, Real DL1No Sigma, Ideal DL1Sigma, Real DL1

Figure 6.3. Channel Scaling

62

For embedded designs, the power required to support multiple speech channels may be

excessive, but such an organization is likely in a server. One channel of speech feature

vectors contributes about 16 KB/s to the memory bandwidth. The outgoing probabilities

consume 2.3 MB/s.

By setting a threshold on acceptable Gaussian scores and selectively sending out the

scores, this can be significantly reduced. The dominant bandwidth component is still

the Gaussian table. Additional Feat SRAMs and Gaussian accelerator datapaths may be

included. Since the Gaussian tables are common for all channels, all datapaths can share

the same Var and Mean SRAMs and thereby reuse the same 180 MB/s vector stream.

With a higher frequency implementation of the Gaussian datapath, multiple channels can

also be multiplexed on the same datapath. In a server, the Gaussian estimation of several

channels can be delegated to a line card, which operates out of its own 18 MB Gaussian

ROM. The partitioning of bandwidth, a 50% reduction in server workload per channel as

well as reduced cache pollution leads to improved server scalability.

6.5.3 Bandwidth Savings

The Hub-4 speech model used in this study has 49,152 interleaved and padded Mean/

Var vectors each occupying three L2 cache lines of 128 bytes or a total of 384 bytes per

pair of vectors. Thus the total size of the Gaussian table is 18 MB. Sphinx processes this

table 100 times every second, but uses a subvector quantization heuristic to cut down the

processing requirement, which in turn leads to lower DRAM bandwidth utilization. To

guarantee real-time processing, the Gaussian accelerator may be used at a low power for

brute force evaluation. Because of the blocking optimization GAU OPT, the data needs

to be processed only 10 times per second with a peak bandwidth of 180 MB/s, which can

be further reduced by applying the subvector quantization (nonfeedback) heuristics in

Sphinx. Not only does this design bring the bandwidth requirements to limits possible on

embedded systems, it also drastically improves the power consumption. On a 400 MHz

Intel XScale development system where the processor itself consumes less than 1 W,

peak memory bandwidth of 64 MB/s was obtained. Achieving this bandwidth consumed

an additional 0.47 W. The factor of four or more bandwidth savings is significant for

63

the embedded space since it indicates that a 52-watt server can be replaced by a 1-watt

embedded processor.

The Gaussian coprocessor takes advantage of the simple loop structure and the lim-

ited precision requirements of the GAU algorithm to make real-time processing of speech

signals possible at greatly reduced power budget. However, its design is quite inflexible

and difficult to adapt to other algorithms like neural net evaluation which involve similar

loops and summation operations. The experience underscores the potential benefits

of programmable accelerators which can use domain specific optimizations to provide

power and performance advantages similar to ASICs.

CHAPTER 7

VISUAL FEATURE RECOGNITION

ALGORITHMS

Visual feature recognition systems vary significantly based on the type of feature that

is being recognized. Relatively simple recognizers are regularly employed in industrial

visual inspection systems. On the other hand, human face recognition is an extremely

complex task given the huge possibility space of facial features and skin tones. Facial

recognition systems clearly have utility in security and surveillance domains, and other

visual recognizers play key roles in gesture interfaces, lip reading to support speech

recognition, and robotics. Interest in face recognition is motivated by the difficulty of

the problem, which cannot be currently supported by embedded systems. This is evident

from Figure 1.1, which showed that a high performance 4.8 GHz processor was required

to satisfy the real-time requirements of the FaceRec application. Furthermore the face

detection algorithms like the neural network based Rowley detector and the rectangle

feature based Viola/Jones detector used in this study are generic approaches for object

detection [83, 103]. They appear to be easily adapted to address other visual feature

recognition tasks. The main differences for these other tasks is a different training

regimen and different frame rate requirements. For example, the Rowley method of

face detection described in Section 7.3 has been applied to license plate detection [83].

Thus, research in accelerating face detection and recognition also helps the detection and

recognition of other objects.

The FaceRec application studied here can be viewed as a pipeline of three major

functional components. A flesh tone detector is used to isolate areas of a frame where

a face is likely to be present. The next stage is a face detector that determines whether

a face is present or not in each area of interest. The final phase is a face recognizer.

65

Each of these components is based on well known algorithms that have been adapted

or reimplemented to fit into a unified framework. Some algorithmic optimization and

restructuring has been done to suit benchmarking purposes, but the basic approach has

been developed by other researchers.

Interestingly the face recognition system, when viewed from a structural perspective

comprises a series of increasingly discriminating filters. Early stages of the sequence

must inherently filter the entire image. As the process proceeds downstream, each stage

needs to examine less image data since previous stages have eliminated certain areas from

the probable candidate list. The result is an interesting balance of simple algorithms that

analyze lots of data early in the sequence and more sophisticated algorithms that only

need to analyze limited amounts of data late in the process. The result is a structure that

is amenable for implementation as an embedded system.

Figure 7.1 shows the major steps in face recognition. The input is a low-resolution

video stream such as 320 × 200 pixel images at 10 frames per second. The stream

is processed one frame at a time, and sufficient state is maintained to perform history

sensitive tasks like motion tracking. The process is essentially a pipeline of filters that

reduce the data and attach attributes to frames for the use of down stream components.

Typically each filter is invoked at the frame rate. This underlines the soft real-time nature

of this application. Additional data is required since filters may access large databases

or internal tables. These additional data streams add to the aggregate bandwidth require-

ment of the system. The periodic nature of the application domain often makes it possible

to easily estimate the worst case requirements.

Object recognition typically proceeds in two steps: object detection and the actual

object identification. Most approaches to object identification require a clearly marked

area, normalized to a particular size, and the location of key features. Object detectors

find the area where the desired feature is likely to reside, scale the area to meet the

normalization requirement, and then create a location and boundary description for that

area. False positives and negatives occur, but the algorithms try and minimize their

occurrence.

66

Figure 7.1. Algorithmic Stages of a Face Recognizer

Object detectors also often work at a fixed scale. The detector is swept across the

image recording all positions at which a detection was reported. The image is then

subsampled or scaled down by a small factor (typically 0.8), and the process is repeated

until the frame is below the size of the detector. A decision procedure is then applied

to all the predicted hits to decide which ones are the most likely. Detectors often have

much lower compute cost per subwindow than their corresponding identifying routines.

Since they are swept across the entire image, a significant portion of the application’s

execution time might be spent in the detector. In contrast, even though identifying filters

are more compute intensive, they are applied only to the high probability regions of the

frame, so their contribution to the overall execution time might be low. Though object

detectors are less compute intensive, they are much more difficult to design due to their

generality. For example a face identifier chooses from one of N known faces, but a face

detector has to distinguish between the infinite sets of faces and nonfaces.

Since detection is time consuming, it is common to structure an object detector as a

cascade of filters with cheaper heuristics upstream identifying potential regions for more

expensive heuristics downstream. An extreme case of this is the Viola/Jones method,

which trains a sequence of about 200 increasingly discriminate filters [103]. A more

common approach when dealing with faces and gestures is to identify the flesh colored

regions of an image and apply a more sophisticated detector to those regions.

67

The identifier receives candidate regions from the detector along with other infor-

mation like probability, scale and feature locations. It typically employs some type of

distance metric from known references to provide a positive identification. In the face

recognizer, the first level of detection is provided by flesh toning which is followed by

an image segmenting algorithm. These are followed in turn by a more complex detector,

voting for high probability regions, an eye locater and finally a face identifier.

7.1 Flesh Toning

Flesh toning identifies flesh colored pixels in an image. The commonly used RGB

color space is not well suited for flesh toning because skin color occupies a wide range

in primary color space. Variations due to lighting and ethnicity are hard to deal with and

skin-like colors on walls and clothing are harder to discriminate. However, skin colors

are tightly clustered in color spaces like HSV. Flesh toning can be done by converting

pixels from sample images into the chosen color space and making a scatter plot with

two colors, one for flesh pixels and one for nonflesh pixels. A boundary is then drawn

around flesh tone clusters. This boundary is then approximated by curves, which can

be described by simple geometric equations. In the image under test, any pixel that lies

inside this new approximated but easily described boundary is considered to be a flesh

pixel.

The base algorithm involves transforming the RGB color space into the NCC (Nor-

malized Color Coordinates) space using the simple equation r = R/(R + G + B),

g = G/(R + G + B). In this space flesh pixels occupy a space bounded by two

parabolas and maximum and minimum x-axis values. Applying two inequalities of the

form ax2 + bx+ c to the color coordinates will predict if the pixel is flesh colored or not

[90]. While this algorithm is simple and achieves good discrimination, it was observed

that it tends to classify certain shades of blue found in clothing as a skin color. A second

algorithm was used to transform the RGB value of a pixel to an HSV (Hue, Saturation,

Value/Luminance) value. In the HSV space, flesh color is tightly clustered allowing the

use of four simple inequalities for flesh tone [14]. In practice the HSV based algorithm

generates too many false positives. However, the consensus of the HSV and NCC space

68

algorithms produces good results. The output of this phase is a bit mask of the same size

as the image where a bit is set if the corresponding pixel is flesh colored.

7.2 Segmentation

Segmentation is the process of clumping together individual pixels into regions where

an object might be found. A common approach is to do a connected component analysis,

which typically forms irregular regions. Since the Viola and Rowley algorithms used

for face detection need rectangular regions, instead of connected component analysis, a

simple algorithm to cut apart the flesh tone bit mask into rectangles was used instead

[103, 83].

Two operators from mathematical morphology are applied to the bit mask: a 3 × 3

erosion operator followed by a 5×5 dilation operator. This has the effect of cutting away

small connections and regions that are likely to be false positives and then smoothing the

bit mask by filling in any small holes in the middle of an otherwise acceptable sized

region. A logical OR of all the rows in the image is then performed to make a single row.

This step is called vertical separation. Runs of “1” values in the single row represent

vertical stripes of the image that contain objects of interest. Runs of “0” values represent

vertical stripes that may be discarded. For each vertical stripe, the columns are logically

OR-ed to create a single column. This is called horizontal separation. Runs of “1”

represent the region of interest. This algorithm can be recursively applied to isolate the

rectangular regions of interest. In the actual implementation, the horizontal separation

steps for all the vertical stripes are done together in an interleaved manner. This has the

effect of converting the column walk across the bitmap into a row walk giving better

cache performance. Recursion is stopped after two levels since this has empirically

provided adequate results. The flesh tone bitmap is discarded at this stage. The output

of this stage is a list of coordinates of the top left and bottom right corners of rectangular

regions of interest and a gray scale version of the image.

69

7.3 Rowley Face Detector

Henry Rowley’s neural net based face detector is well known as a pioneering con-

tribution [83]. Its implementation was provided by the Robotics Institute at CMU. This

detector is designed to determine if a 30 × 30 pixel image contains a face or not. Face

detection is done by sweeping the detector over the image and computing the decision

at each pixel location. Then the image is scaled and reduced in size by a factor of 0.8

and the procedure is repeated. The resulting series of images and detection locations is

called an image pyramid. In the case of real faces, a detection will be reported at several

nearby pixel locations at one scale and at corresponding locations in nearby scales. False

positives do not usually happen with this regularity. Hence a voting algorithm can be

applied to the image pyramid to decide the site of any true detections.

In each window the detector first applies a correction for varying lighting conditions

followed by histogram equalization to expand the range of intensity values. The prepro-

cessed window is then applied to a multilayer neural network where the input layer has

retinal connections to the image window. Neural net evaluation can be represented as:

Y = tanh(N

∑

i=1

W [i] × Image[Connection[i]])

W [] is a set of weights associated with each neural connection and Connection[]

represents the image locations to which the neuron is connected. In practice, Image

contains additional storage following the actual stored image, and the outputs of neurons

are stored to the additional locations. Thus a multilayer network can be evaluated as

if it is a flat retinally connected array of neurons if it is ensured that neurons in deeper

layers follow neurons closer to the retinal layer. The tanh function acts as a sigmoid

shaped nonlinearity and computing it is expensive. Rowley’s original implementation

uses the tanh() implementation provided by the C-library. In the version developed for

this dissertation, it was replaced with an 800 entry lookup table which has produced

identical output to the original for the test images. This simple optimization improved

the performance of the algorithm by a factor of 2.5 on a 2.4 GHz Pentium processor.

70

The retinal layer is followed by a hidden layer comprised of three classes of units.

Four units look at 10 × 10 subwindows, 16 units look at 5 × 5 subwindows and 6 units

look at overlapping 30 × 5 horizontal stripes. The final output of the network indicates

if the 30 × 30 window contains a face or not.

The voting algorithm notes the location and scale of each detection in an image

pyramid. The next step called spreading replaces each location in the pyramid with the

count of the number of detections in a neighborhood. The neighborhood of a location

extends an equal number of pixels along the position and scale axes. The values are

then thresholded and the centroids of all remaining locations are found. Centroids are

examined in descending order of the number of detections per centroid and other cen-

troids that represent a face overlapping the current face are eliminated. The remaining

centroids represent the location of faces found in the image. To further reduce false

positives, multiple neural nets each trained separately may be applied to the image and

their consensus can represent a more accurate detection.

7.4 Viola and Jones’ Detector

Viola and Jones present a new and radically faster approach to face detection based

on the AdaBoost algorithm from machine learning [103]. They claim a factor of 15

speedup over the Rowley detector for their implementation, but they run their detector

directly on entire images without using flesh toning to cut down the search area. Since

their source code is proprietary, their algorithm was reimplemented by the author and

Robert Evans based on example code obtained from Peter Carbonetto at the University

of British Columbia. The visual feature recognizer used for this dissertation research uses

flesh-toning for both the Rowley and the Viola/Jones detectors. Flesh-toning cuts down

the region of the image that the detector needs to process. Under these circumstances a

visual feature recognition system using Rowley’s method performs as well as a system

that uses Viola and Jones’ method. To understand the Viola/Jones detector, the concept

of boosting needs to be explained first.

A random guess to a yes or no question stands the chance of being correct 50% of the

time. If a heuristic can improve the odds by a very small amount then it is called a weak

71

learner. It is possible to generate weak learners for several tasks in a semiautomated

manner by enumerating a huge set of heuristics generated on the basis of combinations

of simple rules and evaluating their performance on a set of samples. A heuristic that can

improve the odds of a guess by a significant amount is called a strong learner. Boosting

is a method of combining several weak learners to generate a strong learner. AdaBoost is

a well known algorithm to generate strong learners from weak learners, while providing

statistical bounds on the training and generalization error of the algorithm [86].

The weak learners in the Viola/Jones algorithm are based on features of three kinds.

A two-rectangle feature is the difference between the sum of the values of two adjacent

rectangular windows. A three-rectangle feature considers three adjacent rectangles and

computes the difference between sum of the pixels in the extreme rectangles and the

sum of the pixels in the middle rectangle. A four-rectangle feature considers a 2 × 2 set

of rectangles and computes the difference between sum of pixels in the rectangles that

constitute the main and off diagonals. For a 24 × 14 subwindow there could be more

than 180,000 such features. The task of the AdaBoost algorithm is to pick a few hundred

features and assign weights to each using a set of training images. Face detection is

reduced to computing the weighted sum of the chosen rectangle-features and applying a

threshold. As in the case of the Rowley algorithm a 30× 30 detector is swept over every

pixel location in the image, and the image is rescaled. Rowley’s voting algorithm is used

to decide the final detection locations.

Computing rectangle features is a simple but slow operation based on the sum or

difference of pixels in adjacent rectangular regions. Recomputing these sums for each

pixel location is very expensive. A major contribution of the Viola/Jones approach is

an intermediate image representation called the integral image. The sum of the pixels

in a rectangular window can be computed easily using the intermediate representation.

The integral image value at pixel location (x,y) in an image is defined as the sum of

all pixels to the left and above the pixel (x,y). This is computationally prohibitive. By

expressing the same relationship as a pair of recurrences, it is possible to compute the

integral image with just one pass over the image. Given the integral image, computing a

feature F reduces to:

72

S =9

∑

i=1

W [i] × IntegraImage[F.Index[i]]

F.score = abs(S − F.mean face) < abs(S − F.mean nonface)

W [] is a set of weights that depends only on the type of the feature being computed.

These are known constants unlike the trained weights of a neural network. Similar to

neural networks F.Index[] is a set of indices denoting connections to specific locations

within the integral image. These are trained for each selected feature. F.mean face

represents the average distance from feature F to a set of rectangles known to contain

faces. Similarly F.mean nonface represents the average distance from feature F to a

set of rectangles known to be devoid of faces. Thus the score for the feature depends

on whether the feature is closer to the population of faces or nonfaces. The decision of

whether an image window contains a face or not is based on the computation:

IsFace =N

∑

i=1

F [i].score > threshold

N is the number of features used for recognition and threshold is determined by the

AdaBoost algorithm.

The original slow approach described in the Viola/Jones paper uses 200 features.

They then go on to describe a faster approach where they cascade many such detectors

with more complex detectors following simpler ones. A window is passed to a detector

only if it was not rejected by the preceding detector. Since training this cascade is a

laborious process, the workload characteristics of this algorithm are modeled with a 100

feature detector.

7.5 Eigen FacesEigenfaces is a well known Principle Component Analysis (PCA) based face recog-

nition algorithm developed by researchers at MIT [99]. A reimplementation of the

Eigenfaces algorithm from researchers at Colorado State University was used in this

73

research [28]. Though the mathematical underpinnings of Eigenfaces are complex, the

entire algorithm is simple and has a structure quite amenable to streaming and high

statically schedulable ILP. Training images are represented as a set of flattened vectors

and assembled together into a single matrix. The Eigen vectors of the matrix are then

extracted and stored in a database. The training face images are projected onto a feature

space, called face space, defined by the Eigen vectors. This captures the variation

between the set of faces without emphasis on any one facial region like the eyes or

nose. The projected face space representation of each training image is also saved to a

database. To identify a face, the test image is projected to face space using the saved

Eigen vectors. The projected test image is then compared against each saved projected

training image for similarity. The identity of the person in the test image is assumed to be

the same as the person depicted in the most similar training image. The actual algorithm

that defines the face space is:

Make Eigen Vectors(ImageList, N, M): ImageList is a set of N training images,

where each image is W × H pixels. M is the number of Eigen vectors that needs to be

generated.

1. Flatten each image into a WH element vector by concatenating all the rows. Let

ImageMatrix be the N ×WH matrix containing all the flattened images.

2. Sum up all the rows of ImageMatrix and divide by N to get an average flattened

image. Call this WH element vector as ψ.

3. Subtract the average image ψ from the flattened images in ImageMatrix. Let the

new N ×WH matrix be φ.

4. Compute dot products of all possible image pairs. Let L be the new N ×N matrix

where L[i][j] = dot product of φ[i] and φ[j].

5. Compute the N Eigen values and corresponding Eigen vectors of L. Pick the M

Eigen vectors corresponding to the highest Eigen values. Each Eigen Vector is N

elements long.

74

6. Do a matrix multiplication of each of the selected M Eigen vectors against φ and

save the resulting set of 1×WH sized matrices as a combined M ×WH element

EigenMatrix in a database. Save the average image ψ also to the database.

The projection algorithm follows:

Project to Face Space(Image): Image is W ×H pixels in size.

1. Let img be the flattened WH element vector form of Image.

2. Load the average image ψ and the EigenMatrix from the database.

3. Subtract the average image ψ from img to create a new image img ′.

4. Take the dot product of img′

against each row of EigenMatrix to obtain an M

element vector img′′.

5. Let norm =√

∑Mi=1 img

′′ [i] × img′′[i]. Divide each element of img′′ by norm.

This is the face space representation of Image.

Learning is a matter of projecting all the known faces to the face space and saving the

projected representations and the identity of each person.

Learn Faces(ImageList, N, M): ImageList is a set of N training images, where each

image is W ×H pixels. A person’s name is attached to each image. M is the number of

Eigen vectors needed.

1. Call Make Eigen Vectors(ImageList, N, M)

2. For each image in ImageList, call Project to Face Space(image) and save the

resulting projected faces to a database.

Identification is then a simple matter of projecting the test image to face space and

computing a similarity score.

Identify(Image): Image is W ×H pixels in size.

1. Load the saved known projected faces from the database.

75

2. proj = Project to Face Space(Image)

3. Take the dot product of proj against each known projected face. Call this the score.

4. The known projected face that gets the highest score is considered the identity of

the test image.

7.6 Architectural Implications

To report the algorithmic complexity of the various phases of the visual feature

recognizer, an n×n pixel square image is assumed in this section. The flesh tone detector

applies algebraic transformations and inequalities on each pixel. Thus its complexity

is O(n2). The face detectors sweep a basic detector across all pixel locations in the

image and then rescale the image, so their complexity grows as O(n2log(n)). This will

be compounded by the complexity of the base detector itself. For the Rowley method

using N neurons of length L, the base detector complexity is O(LN) giving an overall

complexity of O(n2log(n)LN). The complexity of Viola/Jones style base detector with

N rectangle features1isO(N), which yields an overall complexity ofO(n2log(n)N). For

each region where a face is likely to be present, EigenFaces performs O(Mn2 + KM)

operations where M is the number of Eigen vectors and K is the number of known

faces. This complexity is due to the M dot products on vectors of length n2 done while

projecting the test image and the K dot products on M element vectors done while

finding the most similar known face.

In all cases, the workload scales faster than n2 as the image size is increased, so high

performance architectures are necessary for larger images. For increased accuracy both

the Rowley and Viola detectors need a larger number of neurons and features respectively

leading to a linear increase in compute requirements. For EigenFaces, increasing the

discrimination by using a larger number of Eigen vectors leads to a linear increase in the

compute requirements as does increasing the number of known faces to check against.

Each of the phases is a natural fit for a streaming architecture. Since the flesh toner

works on one pixel at a time, the image may be streamed pixel or raster line at a time

1Features not cascaded. Ignoring the integral image computation.

76

through the processor. The face detectors work on rectangular regions of an image, thus

the ability to hold a 30 × 30 image window on chip and stream the neurons or features

through the processor is important. For a modest increase in on chip storage to about

16 KB both the neuron and feature descriptions can be held within the processor and

image windows may be streamed through the processor. Since both detectors sweep

their image windows row by row and column by column, the ability to hold 30 raster

lines on chip will greatly reduce the number of image window fetch operations. Since

they both work on gray scale images, the additional SRAM required is merely 9.3 KB

for 320 × 200 sized images. While they have very different conceptual backgrounds,

the base detectors of the Viola and Rowley algorithms are remarkably similar and both

involve indirect vector access and dot product operations. The Viola algorithm uses

a short vector length of nine and uses integer multiply accumulate operations while

the Rowley method uses longer vectors with lengths ranging from 11 to 151 with the

sizes 101, 151 and 26 covering 87.8% of all evaluated neurons. Currently, neural net

evaluation involves floating point multiply accumulate operations. Given the limited

range of weights and histogram equalized image pixels, this could possibly be converted

to scaled integer arithmetic.

Similarly, EigenFaces is dominated by floating point dot products which in turn

depend on floating point multiply accumulate operations. Each test image needs to be

projected to face space based on the stored Eigen vectors. This is a series of dot product

operations, and each stored Eigen vector may be simply streamed through the processor

while holding the flattened image within the processor. The vector length is equivalent to

the number of Eigen vectors. Values of 50 or more are required in practice. Identification

can be done by holding the projected test image constant in the processor and streaming

the known projected images for computing dot products. Thus it can be seen that on

the whole the nature of visual feature algorithms lends themselves to efficient stream

processor implementations.

CHAPTER 8

CHARACTERIZATION OF VISUAL FEATURE

RECOGNITION

This chapter provides a detailed characterization of the visual feature recognition

system described in Chapter 7. Native execution, profiling using processor performance

counters, and simulation were used to characterize the application. The native execution

results were obtained using SGI SpeedShop on a 666 MHz R14K processor. Unlike

the results presented in Chapter 5, which used the SimpleScalar 3.0 simulator, results in

this chapter are based on ML-RSIM, an out of order processor simulator derived from

the Rice University RSIM simulator. This change was motivated by two reasons. First,

the visual feature recognition application is implemented in C++, but the compiler used

by SimpleScalar does not support C++. Since ML-RSIM accepts binaries compiled for

SunOS, it was possible to generate the application binary on a Sun workstation. Second,

a stable version of ML-RSIM was not available at the time the experiments in Chapter 5

were conducted.

A derivative of the Net BSD operating system was run within the simulator. An

application binary compiled for SunOS was used without any modification since the OS

emulates the SunOS system call interface. Two different configurations were simulated:

a multi-GHz processor whose parameters like L1 cache hit time, memory access time,

floating point latencies, etc., were measured on a 1.7 GHz AMD Athlon processor using

the lmbench hardware performance analysis benchmark and an embedded configuration

which is modeled after an Intel XScale 400 MHz processor except for the fact that it

uses a Sparc ISA and has a floating point unit [68]. Since ML-RSIM could not be

configured without an L2 cache, an inclusive L2 cache equivalent in size to the combined

L1 instruction and data caches was added. Since the cache is inclusive and the same size

78

as the sum of the L1 caches, this configuration behaves similar to a machine with no

L2 cache. Numbers that could not be directly measured were obtained from vendor

microarchitecture references. ML-RSIM was configured to reflect the parameters shown

in Table 8.1. Unless mentioned otherwise, the remainder of this chapter uses the default

configuration.

The application is studied in five configurations: a) full pipeline using the Rowley

face detector, b) full pipeline using the Viola/Jones face detector, c) only the Rowley face

detector with flesh toning and image segmentation, d) only the Viola/Jones face detector

with flesh toning and image segmentation, e) only the Eigenfaces recognizer. The last

three configurations are important from an energy savings perspective since running the

individual algorithms on separate low frequency processors or hardware accelerators can

lead to significant energy savings.

Table 8.1. Experiment Parameters

Native Execution:SGI Onyx3, R14K processors at 666 MHz32 KB 2-way IL1, 32 KB 2-way DL1, 8 MB L2Software: IRIX 64, MIPS Pro compiler, Perfex, SpeedshopSimulator: (default configuration)

Sparc V8 ISA, out of order CPU model, 2 GHz16 KB 2-way IL1, 2 cycle latency, 16 KB 2-way DL1, 2 cycle latency2 MB 2-way L2, 20 cycle latency, 228 cycle DRAM latencyL1 line size 64 bytes, L2 line size 128 bytesIssue width: 4 integer + 4 floating point, Max 4 graduations/cycleDRAM interface: 600 MHz, 64 bits wideSoftware: gcc 2.6.3Embedded ConfigurationSparc V8 ISA, 400 MHz32 KB 32-way IL1, 1 cycle latency, 32 KB 32-way DL1, 1 cycle latency64 KB inclusive L2 cacheL1 line size 64 bytes, L2 line size 128 bytesIssue width: 1 integer or 1 floating point, Max 1 graduation/cycleDRAM interface: 100 MHz, 32 bits wideSoftware: gcc 2.6.3

79

8.1 Application Characteristics

Figures 8.1 and 8.2 show the relative execution times of each algorithm when the

application is run using the Rowley detector and the Viola/Jones detector. In both cases,

the face detector is the dominant component. Since the detectors are heuristic in nature,

the face regions identified by them may differ. This in turn leads to differences in

the runtime of other algorithms that depend on the detector’s output. Figures 8.3 and

8.4 show the L1 Dcache miss rate and the L2 cache hit rates for all five application

configurations. Since the caches are inclusive, the L2 hit rate is defined as the L1 misses

that hit in the L2 cache divided by the total number of accesses made by the application.

Since this application achieves 99.8% Icache hit rate with a 16 KB Icache, no other

Icache configurations were studied. Figure 8.5 shows IPC for a variety of execution unit

configurations, and Figure 8.6 shows the run times normalized to real time.

For the entire application there is consistently greater than 92% L1 cache hit rate

for Dcaches of 16 KB and above. This indicates that the streaming pipelined model for

composing the algorithms is a good fit for the problem. Each 320 × 200 pixel color

image is 187.5 KB long and the corresponding gray scale versions are about 64 KB.

The images clearly will not fit in the L1 cache. The explanation is that the color image

is accessed in streaming mode, i.e., each pixel is touched exactly once for flesh toning.

Image segmentation works on the flesh tone bitmap (approximately 64 KB) making at

Flesh tone3.9%

Viola59.7%

Eye locator17.1%

Eigenfaces19.4%

Figure 8.1. Execution Time Break Down of Viola/Jones Detector Based FaceRecognizer

80

Flesh tone6.2%

Rowley64.6%

Eye locator10.4%

Eigenfaces18.8%

Figure 8.2. Execution Time Break Down of Rowley Detector Based Face Recognizer

most two passes over it. Since these accesses touch at most two image rows at a time,

good cache utilization is insured. Subsequently, only small windows into the image are

used. Since objects in these images are typically smaller than 50× 50 pixels, each object

is only about 2.5 KB in size. The downstream algorithms make several passes over each

object, but only a small part of each object needs to be cache resident at each time.

For example, the integral image computation in the Viola/Jones algorithm is based on a

recurrence that involves two adjacent image rows and an additional row for intermediate

storage and has an L1 cache footprint of about 4.4 KB. The Rowley algorithm touches at

most 30 rows of the object at the same time. However, as it sweeps across the image left

to right and top to bottom only a 30 × 30 pixel window needs to be cache resident at a

time. Since it shifts its position one pixel at a time, a 29 × 29 region of this window will

be reused by the next iteration contributing to high L1 cache hit rate. A similar pattern

occurs in the later phase of the Viola/Jones algorithm on a 30×30 region. The Eigenfaces

algorithm uses a projected image of the object to be recognized as well as basis, mean

and projected image matrices corresponding to each reference object. The target object is

reused while it is compared against each candidate. Each candidate, however, is accessed

only once per target object.

The objects and their attributes from each stage are typically touched again by the

next stage. The auxiliary information used by the algorithms is somewhat small. Both

81

8 KB16 KB

32 KB64 KB

L1 Data Cache Size

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Mis

s R

ate

(Per

cent

)

9.22

%

6.40

%

4.16

%

2.82

%

9.97

%

7.11

%

5.11

%

3.12

%

9.52

%

6.19

%

3.83

%

2.24

%

10.7

7%

7.68

%

5.54

%

3.16

%

6.62

%

3.90

%

2.50

%

1.93

%

Viola AppRowley AppViolaRowleyEigenfaces

Figure 8.3. L1 Dcache Miss Rate

detector algorithms use fixed size data structures. The worst case is the Viola/Jones

algorithm, which needs a weight and a type for each feature corresponding to 100× 2 ×

4 = 800 bytes of L1 cache. The data set for the Eigenfaces algorithm on the other hand

is linear in the number of the reference faces. Since these could potentially be streamed

into the L1 Dcache once per target object (or once per frame), the footprint is small. Only

the projected target object and a small part of the basis/mean/projected reference images

need to be resident in the L1 Dcache. From Figure 8.4 it is seen that the L2 cache is

largely ineffective since it is accessed infrequently due to the low L1 miss rate.

From a cache footprint perspective, both the detector algorithms and the entire appli-

cation appear to be a good match for embedded processors with limited cache resources.

Since images are accessed left to right, multiple rows at a time, sequential prefetch (or

strided prefetch) would hide memory access latencies even when the L1 Dcache is small.

82

256 KB512 KB

1024 KB

2048 KB

L2 Cache Size

0

0.5

1.0

1.5

2.0

2.5

Hit

Rat

e (P

erce

nt)

0.82

%

1.08

% 1.30

%

1.41

%

0.82

%

1.10

%

1.38

% 1.52

%

0.84

%

1.13

%

1.37

%

1.46

%

0.86

%

1.23

%

1.46

%

1.56

%

0.66

% 0.82

% 0.96

%

0.82

%


Figure 8.4. L2 Cache Hit Rate

Quite a different view unfolds on examination of the IPC and speedup graphs. Figure 8.5

shows IPC for a variety of execution unit configurations. IPC is seen to saturate early on

for two main reasons. The first is caused by dependences in the loop bodies. For example,

neural net evaluation involves computing Σi=0..nWeight[i] ∗ Image[Connection[i]]. In

addition to the loop carried dependence on the sum, each of the inputs is accessed

indirectly via a pointer since an input to one neuron could be the output of another

neuron. Second, the high ratio of array variable accesses to arithmetic operations causes

saturation of the Dcache ports.

Figure 8.6 shows the run times normalized to real time. Here, 1.0 represents min-

imum real-time performance corresponding to 5 frames per second. For example, in

Figure 8.6 in the 1 ALU + 1 FPU configuration, the Rowley algorithm is 1.13 times

83

Embedded 1+1 2+2 3+3 4+4

ALUs + FPUs

0

0.25

0.5

0.75

1.0

1.25

IPC

0.48

0.65 0.

69 0.72

0.72

0.49

0.65 0.67 0.

70

0.70

0.56

0.71 0.

76 0.78

0.78

0.52

0.67 0.69 0.71

0.71

0.52

0.74 0.

80

0.89

0.89


Figure 8.5. IPC

slower than real time while the Eigenfaces algorithm processes 5 frames in 0.69 seconds.

The graph clearly shows that embedded processors are inadequate to handle the work

load in real time. In this case instruction throughput is the culprit. Even when function

units are available, dependences and contention for the Dcache ports causes low IPC.

The power budgets required for real-time performance are beyond what is available

on normal low power embedded platforms. Thermal dissipation is a problem even on

high performance processors and energy saving solutions are important for real-time

workloads like visual feature recognition. Hardware accelerators that use specialized

data paths and stream array operands out of multiple SRAM buffers stand a good chance

of accelerating these algorithms at embedded power budgets.

84

Embedded

(400 MHz)Embedded

(400 MHz)

0

2

4

6

8

10

12

14

16

18

20

22

Run

tim

e no

rmal

ized

to r

eal-t

ime

19.6

99.

93 10.8

97.

294.

96

Embedded

(2 GHz) 1+1 2+2 3+3 4+4

ALUs + FPUs

Embedded

(2 GHz) 1+1 2+2 3+3 4+4

ALUs + FPUs

0

0.25

0.5

0.75

1.0

1.25

1.5

1.75

2.0

2.25

2.5

2.75

3.0

3.25

3.5

3.75

4.0

4.25

4.5

3.94

2.88

2.72

2.59

2.59

1.99

1.52

1.47

1.40

1.40

2.18

1.71

1.59

1.54

1.54

1.46

1.13

1.09

1.06

1.06

0.99

0.69

0.64

0.58

0.58


Figure 8.6. Speedup or Slow Down Over Real Time

8.2 Optimization Opportunities

One recurring theme in image processing is computing a kernel that operates on an

M×N subwindow of a largerW×H image. The kernel is recomputed for every possible

pixel location within the larger image. This resembles sliding the M × N subwindow

over the W × H image. There is significant scope for compiler based reordering of

computations in such kernels. Here are two concrete examples.

As described in Section 7.4, the heuristics used by the Viola/Jones algorithm are

based on the sum/difference of pixels in adjacent rectangular regions. Recomputing these

sums for each pixel location is very expensive. A major contribution of their approach

is an intermediate image representation called the integral image. The sum of the pixels

in a rectangular window can be computed easily using the intermediate representation.

85

The integral image value at pixel location (x,y) in an image is defined as the sum of

all pixels to the left and above the pixel (x,y). This is computationally prohibitive. By

expressing the same relationship as a pair of recurrences, it is possible to compute the

integral image with just one pass over the image. This transformation required careful

study and insight from the originators of the algorithm. Given the fact that the sums

of rectangular subwindows of the larger image are recomputed at each pixel location, a

compiler based tool aware of the access pattern and rules of arithmetic may be designed

to deduce the recurrences.

The standard deviation of pixel values within a 30× 30 pixel window starting at each

pixel location within the probable face rectangle is required during the computation of

Viola/Jones heuristics. An initial implementation that simply recomputed the standard

deviation function at each pixel location was seen to occupy between 10-15% of the

compute time of the whole application. When going from one pixel to the next, the

windows overlap by 29 × 30 pixels and the mean and sum of squares for one pixel can

be easily calculated from its predecessors values by adjusting for the nonoverlapping

pixels alone. By defining a set of recurrences for the mean and mean square for 30 × 30

subwindows over a wider region, it is possible to compute the standard deviations in one

pass over the image thereby reducing the execution time of this component to less than

1%. Currently, such transformations require a lot of attention from the programmer and

insight into the algorithm and are error prone because of corner cases. This bolsters the

argument in favor of compiler based loop restructuring that can apply axioms of algebra

to deduce the right set of recurrences.

Another possible optimization is to reorder the computation so that data may be

streamed through a set of execution units and results computed in the minimum number

of passes while observing limits on the amount of intermediate storage used. Compiler

based tools that consider parameters like the size of the image and the subwindow, and

the size of the intermediate storage and automatically transform algorithmic kernels for

optimum stream performance would be desirable.

As seen in Figures 8.5 and 8.6, wide issue clearly helps performance. In traditional

architectures, wide issue usually comes at the cost of increased hardware complexity and

86

could potentially limit the clock frequency as well as exceed a limited energy budget.

This application is embarrassingly parallel in most sections due to the intrinsic data

parallelism in the pixel processing. One way of achieving good performance at low

power is to use a cluster of function units operating in parallel with a very small quantity

of SRAM for local storage and no cache. This approach is investigated in the next

chapter.

CHAPTER 9

PERCEPTION PROCESSOR ARCHITECTURE

Chapter 3 explained that achieving high IPC was critical to realizing high-performance,

low-power perception processors. Chapters 4 and 7 described the structure of typical

perception algorithms, which are characterized by simple multilevel nested loops where

the majority of arithmetic and floating point operators have array and vector operands.

Operand availability is therefore critical to achieving high IPC. It was also seen that per-

ception applications may be expressed as a pipeline of algorithms. These facts motivate

the choice of architectures that embody function unit clusters for high ILP and simple

communication mechanisms that permit chaining multiple processors to implement a

pipeline of algorithms. Perception processors that are general enough to be able to

execute multiple algorithms yet are small enough to conserve energy and die area would

be ideal. An empirical search for a processor architecture that satisfies the generality,

high IPC, and low resource utilization criteria led to an initial architecture [67] that was

successively refined. The end result of this evolutionary process is depicted in Figure

9.1.

The perception processor architecture consists of a set of clock gated function units,

a loop unit, three dual ported SRAMs, six address generators (one for each SRAM

port), local bypass paths between neighboring function units as well as a cluster wide

interconnect. A register file is conspicuously absent because the combination of compiler

controlled dataflow and a technique called array variable renaming makes a register file

unnecessary. Though none of the clusters described here need a register file, it is possible

to incorporate one into a function unit slot. Clusters can be configured to maximize the

performance of any particular application or set of applications. Typically there will be a

minimum number of integer ALUs as well as additional units that are more specialized.

88

Figure 9.1. Perception Processor Organization

Hardware descriptions for the cluster and the interconnect are automatically generated

by a cluster generator tool from a configuration description. Details may be found in

Section 9.8.

To understand the rationale behind this organization it is important to know that

typical stream oriented loop kernels found in perception algorithms may be split into

three components. They consist of control patterns, access patterns and compute patterns.

The control pattern is typically a set of nested for loops. Access patterns seen in these

algorithms are row and column walks of 2D arrays, vector accesses and more complex

patterns produced when simple array accesses are interleaved or software pipelined.

Compute patterns correspond to the dataflow between operators within the loop body.

For example, the compute pattern of a vector dot product is a multiply-accumulate flow

where a multiplier and an adder are cascaded and the adders output is fed back as one of

89

its inputs.

The perception processor has programmable hardware resources that accelerate each

of the three patterns found in loops. The loop unit accelerates control patterns while

the address generators cover access patterns. The interconnect and the function units

together implement compute patterns. The execution cluster operates in a VLIW manner

under the control of horizontal microcode stored in the microcode SRAM. The mi-

crocode provides the opportunity to clock gate each resource individually on a cycle by

cycle basis leading to low energy consumption. Together, these features provide the mix

of high performance and hardware minimality that is crucial to perception applications.

9.1 Pipeline Structure

The perception processor architecture was designed to be able to emulate dataflows

that typically occur within custom ASIC accelerators. To this end, it has a simple

and rather different pipeline structure from a traditional processor. In sharp contrast to

the typical five-stage Instruction Fetch/Instruction Decode/Execute/Memory/Write Back

(IF/ID/EX/MEM/WB) pipeline of a MIPS like RISC processor, the perception processor

pipeline consists of just three stages: Fetch/Decode/Execute [46]. The number of actual

stages in the final execute phase depends on the function unit. The pipeline structure is

shown in Figure 9.2. Conspicuous departures from the RISC model include the absence

of register lookups in the decode stage and the lack of memory and write back stages.

In the perception processor, the microinstructions are fetched from a very wide in-

struction memory which is more than 200 bits wide. The decode stage is minimal – it is

limited to performing sign or zero extensions to constants, generating NOPs for function

units while the memory system is being reconfigured, and generating clock enable signals

for active function units. The wide instruction is then dispatched to a set of function units,

a loop unit, and a set of address generators. All resources, including the actual function

units and SRAM ports, appear as peers in the EX stage. The final output of all these peer

units can be transferred back to the input of the units by an interconnect network. The

latency of transfers depends on proximity. Nearest neighbors can be reached in the same

cycle while reaching a nonneighboring unit incurs an additional cycle of latency.

90

Figure 9.2. Pipeline Structure

In the MIPS RISC execution model, every single instruction implicitly encodes a

path through the pipeline. An integer instruction takes the IF/ID/EX/MEM/WB while a

floating point instruction takes a detour through the FPU in the EX stage. There is also an

implicit hardware controlled timing regime that dictates the relative cycle time at which

an instruction reaches each stage subject to dependences checked by interlocks.

In the perception processor, instructions do not encode any such implicit paths. The

instructions are called microcode because they serve the traditional horizontal microcode

function where individual bits directly control hardware functions like mux selects and

register write enables. To get the functionality implied by a MIPS instruction, the stage

91

by stage functionality of the MIPS instruction must be identified and the equivalent

microinstruction bits set in several successive microinstruction words. The advantage

of this lower level approach is that the hardware can be controlled in a fine grained

fashion, which is impossible in the MIPS case. For example, interconnect muxes may

be set to route data between selected function units and memory in a manner which

directly represents the dataflow graph of an algorithm, and data may be streamed through

the dynamically configured structure. The ability to reconfigure the structure through

microcode on a cycle by cycle basis means that the function units may be virtualized to

map flow-graphs that are too large to fit the processor. This manifests itself as higher

initiation intervals and larger number of temporary results that need to be saved or

rerouted when compared to a processor that has enough physical resources to allocate

to the entire flow-graph. Performance degrades gracefully under virtualization. The

perception processor supplants the instruction centric RISC execution model with a data

centric execution model, which lends it the flexibility to efficiently mimic the styles of

computation found in VLIW and vector processors as well as custom ASIC datapaths.

9.2 Instruction Format

To understand the following discussion on the internals of the perception processor

a quick introduction to the microinstruction format and the instruction fetch mechanism

is necessary. Figure 9.3 shows the constitution of a typical instruction word. While

Figure 9.3. Microinstruction Format

92

the instruction word width and format are fixed for a given configuration, they will

vary between configurations depending on the type and number of function units and

interconnect paths. The type field specifies whether the instruction is a normal VLIW

style instruction bundle or a reconfiguration command. Reconfiguration commands are

used to dynamically modify the working of the address generators and the loop unit.

The type field is followed by instruction packets for each function unit. If the type

field specifies a reconfiguration command, the instruction packet fields have alternate

interpretations. In that case, the decoder makes NOP packets for all the function units.

Each instruction packet consists of an opcode, mux selects for the A and B operands

selection muxes of a function unit and enable signals for the A and B input registers. The

registers latch new values only when their enable signals are asserted. These FU opcode

packets are followed by address generator operations each of which specify a load, store

or NOP and the address context register to be used for the load or store operation. These

are in turn followed by mux select signals for the interconnect muxes. Finally, there

are a set of constant fields to support constants used in the code. The constant fields

have different interpretations (e.g., one 16-bit constant, two 8-bit constants, four 4-bit

constants, etc.) depending on the context. The decoder can perform modifications like

sign or zero extension before the constants are presented to the function units. The

instruction memory has 1 cycle latency. The decoder adds another cycle of latency. This

2 cycle fetch delay is accounted for in branch instructions and the loop unit logic. Since

the actual bit positions of various fields depends on the configuration, the instruction

fetch logic and the decoder are automatically generated by a netlist generator tool based

on the processor configuration and bundling constraints.

9.3 Function Units

Function units follow the generic organization shown in Figure 9.4. Their operands

may be the output of their own final stage or the output of their left or right neighbor.

Forwarding the output of the unit to its input allows efficient execution of reduction

operators like∑

and∏

and polynomial terms like Axn. Nearest neighbor connections

capitalize on the short delay of local wires to implement chained operations in a manner

93

Figure 9.4. Function Unit Architecture

similar to vector chaining. In addition an operand may also arrive over the interconnect,

in which case the transferred value is first latched in a register. The interconnect register

can also hold semistatic operands like constants used for scaling an operand stream.

Several types of function units are used in this study.

Integer ALUs perform common operations like add, subtract, xor, etc. ALUs also

have compare instructions, which not only return a value, but also set condition codes

local to the particular ALU. Conditional move operations may be predicated on the

condition codes set by previous compare instructions to route one of the two ALU inputs

to the output. This makes if-conversion and conditional data flows possible. All ALU

operations have single cycle latency.

94

FPUs support floating point add, subtract, multiply, compare and integer to floating

point convert operations. While the FPU is IEEE 754 compatible at its interfaces, for

multiply operations it internally uses a reduced precision of 13 bits of mantissa since the

target applications work well with this precision [66]. Reduced precision in the multiplier

contributes significant area and energy savings. All FPU operations have 7 cycle latency.

Multiply units support 32-bit integer multiply operations with 3 cycle latency.

In order to illustrate the advantages of fine grain pipeline control and modulo support

and to demonstrate the generality claims, no application specific instructions have been

added to the function units with two exceptions: the reduced precision of floating point

multiplies and byte select/merge instructions, which select an individual byte from a

word. The latter is similar to the pack/unpack instruction in Intel’s IA-64 architecture or

the AL/AH register fields in the IA-32 architecture. These instructions significantly ease

dealing with RGB images.

9.4 Compiler Controlled Dataflow

As CMOS technology scales, wire delays get slower when compared to logic. The

cluster interconnect reflects the belief that future architectures will need to explicitly

address communication at the ISA level. Traditional architectures are based on implicit

communication. For example the MIPS instruction addi r1, r2, 10 depends on the

hardware to keep track of the last location where the operand r2 was present and transfer

it to where it is consumed. The location could be a renamed register or a pipeline stage.

In a wide issue clustered processor, it is advantageous to have operands to a function unit

be sourced from nearby function units to hide the effects of long wire delays. This is

possible if communication is explicitly orchestrated by the compiler. In the perception

processor all communication is explicitly orchestrated by the compiler. In the example

above, the compiler would pick a function unit to execute the addi instruction, transfer

the output of the function unit that last produced the value corresponding to the r2

operand to the A input of the chosen function unit, transfer the constant 10 to the B input

and schedule the actual addition to happen the cycle when both inputs are available.

In the perception processor, pipeline registers at the interfaces of every unit including

95

function units and SRAM ports are named and accessible to software. Data is explicitly

transferred from the output pipeline register of a producer to the input registers of its

consumers. Unlike traditional architectures where pipelines shift under hardware control,

a compiler for the perception processor can use clock gating to control pipeline shifting

and thereby control the lifetime of values held in pipeline registers. This ensures that a

result will be alive till all its consumers have received a copy. This explicit management

of result lifetime and communication is called compiler controlled data flow.

Explicit communication leads to the ability to overlap communication with com-

putation with almost no hardware overhead. A significant number of bits in the wide

microinstruction word are devoted to controlling the interconnect. While the interconnect

can be controlled on a cycle by cycle basis, the compiler may elect to dedicate certain

interconnect muxes to flows on a longer term basis. For example, while adding two

vectors it is possible to dedicate separate interconnect muxes for the two operands for

the duration of the vector addition. The compiler also attempts operand isolation, i.e., it

tries to set unused muxes to states that reduce the amount of activity visible to the rest of

the circuitry leading to lower power consumption.

9.5 Interconnect

The local bypass muxes in each function unit are intended for fast, frequent com-

munication with the immediate function unit neighbors. The interconnect supports com-

munication with nonneighbor function units and SRAMs. Such communications have

a latency of one cycle. In a multicluster configuration, intercluster communication will

incur even larger delays. Values transferred via the interconnect to the input registers of

a function unit may be held indefinitely which is useful for caching common constants.

In modulo scheduled loops, each resource may be used only during one modulo

period. Reusing a resource later will render the loop body unschedulable. It is common

to find a lot of data reads early in the loop body and a few stores toward the end that

correspond to computed values graduating. Conflicts in the interconnect often make

modulo scheduling difficult. Partitioning the interconnect muxes by direction has the

potential to reduce scheduling conflicts. Incoming muxes transfer data between function

96

units and from SRAM ports to function units while outgoing muxes are dedicated to

transferring function unit outputs to SRAM write ports.

The high level architecture of the interconnect is remarkably simple. Assume an

organization with N incoming muxes and M outgoing muxes as shown in Figure 9.5.

Each incoming mux is logically a 16-to-1 mux which selects the output of one of the eight

function units, six SRAM ports or constant fields within the microinstruction. There is

some hierarchy in the actual circuit to optimize size and delay. There is currently an

unused port in the 16-to-1 mux which is reserved for inter cluster communication in

future multicluster configurations. As seen in Figure 9.4, there are two interconnect

pipeline registers at the input of each function unit. Half of the N muxes feed the A input

registers of function units. The muxes are connected to the input registers in round robin

manner. The other N/2 muxes serve the B input registers. The muxes are partitioned by

input register so that both operands of a function unit may be delivered from elsewhere

in the cluster without conflict. The M outgoing muxes are 8-to-1 muxes that connect

the function unit outputs to the SRAM write ports. Again, the muxes are connected in a

round robin manner to the SRAM data inputs. Upon specifying values for N and M, a

netlist generator tool developed as a part of this research generates Verilog HDL for the

processor and the interconnect. While the simple round robin connections have worked

Figure 9.5. Interconnect Architecture

97

well for the benchmarks used in this research, it is possible to manually specify any

custom topology for the interconnect. The choice of interconnect parameters depends on

the dataflow within the algorithm kernels and the number of computed results that need

to be retired per cycle. It is possible to implement compiler based instruction scheduling

algorithms that are topology neutral by describing communication paths as a weighted

graph structure, an approach which was used in an earlier version of this architecture

[67]. The actual processor configurations that are evaluated in Chapter 10 uses four

incoming muxes and one outgoing mux.

It is possible that two operands need to be made available at a function unit as part of

a dataflow but interconnect conflicts make such a transfer impossible. In such cases it is

possible to transfer one operand in an earlier cycle and freeze its destination interconnect

register using clock gate control till both operands arrive and can be consumed. The

conflict can thus be resolved and a feasible schedule attained, but latency and loop

initiation interval increase somewhat as congestion increases. This method of staging

during separate cycles, transfers that are logically simultaneous, is called interconnect

borrowing.

9.6 Memory System Architecture

Perception applications are stream oriented with a large number of 2D array and

vector accesses per elementary operation. These accesses typically occur within tight

loops with known bounds. Traditional processors have a limited number or load/store

ports, and this limits overall performance because of the high number of array accesses,

which is the reason DSPs traditionally partition their memory resources. A large number

of SRAM ports are required to efficiently feed data to function units. Increasing the

number of ports on a single SRAM or cache increases access time and power consump-

tion. This motivates the choice of multiple small software managed scratch SRAMs. It

is also possible to power down SRAMs that are not required. For low leakage processes

a large fraction of the energy consumption is in the sense amplifiers of the SRAM ports.

They consume approximately 50% of the processor energy in the 0.25µ implementation.

98

Mechanisms to efficiently use these expensive resources are important for both perfor-

mance and energy conservation.

Hardware performance counter based measurements on a MIPS R14K processor

showed that 32.5% (Geometric mean) of the executed instructions were loads/stores

for a set of perception benchmarks described later in Section 10.1. The high rate of

load/store operations combined with the regular array access patterns makes it possible

to overlap computation and SRAM access possible using hardware accelerators. A

large fraction of the remaining 67.5% execution component is array address calculations

that support load/store operations. Significant optimizations are possible by associating

each SRAM port with an address generator that deals with common access patterns of

streaming applications. The access patterns include 2D array and vector accesses in

modulo scheduled or software pipelines loops. Details may be found in Section 9.6.4.

Four new instructions are required to take advantage of the optimizations:

write context context index, src:

Reconfigure an address generator by transferring a description of an access pattern

into a context register within the memory system. This instruction when applied to the

loop unit similarly transfers the parameters of a loop into a loop context register.

load.context dest, context index and

store.context context index, src:

These are loads/stores that use the address generation mechanism. The context index

encoded into the immediate constant field of the instruction specifies the address gener-

ator to be used and the index of a context register within it.

push loop context index:

Let the memory system know that a new loop is starting.

9.6.1 Loop Unit

The index expressions of array accesses in a multilevel nested loop will depend

on some subset of the loop variables. The purpose of the loop unit is to compute and

maintain the loop variables required for address generation in the memory system while

the loop body itself is executed in the function units. Figure 9.6 shows a simplified orga-

99

Figure 9.6. Loop Unit

nization of the loop unit. The loop unit offers hardware support for modulo scheduling,

a software pipelining technique that offers high levels of loop performance in VLIW

architectures [81].

A brief introduction to some modulo scheduling terminology is necessary to under-

stand the functioning of the loop unit. Assume a loop body which takes N cycles to

execute. Modulo scheduling allows starting the execution of a new instance of this loop

body every II (Initiation Interval) cycles where II is less than N . A normal loop that

is not modulo scheduled may be considered a modulo scheduled loop II = N . How II

is determined and the conditions that must be satisfied by the loop body are described

in [81]. The original loop body may be converted to a modulo scheduled loop body by

replicating instructions such that every instruction that was originally scheduled in cycle

n is replicated so that it also appears in all possible cycles (n + i × II)modN where i

is an integer. This has the effect of pasting a new copy of the loop body at intervals of

II cycles over the original loop body and wrapping around all instructions that appear

100

after cycle N . If a particular instruction is scheduled for cycle n, then n/II is called its

modulo period.

The compiler configures static parameters including II and loop count limits into

loop context registers. The corresponding dynamic values of the loop variables are

held in the loop counter register file. The only other piece of information required is

which loop body is currently pointed to by the program counter. A four-entry loop stack

captures this information. In this implementation, the loop unit can keep track of four

levels of loop nest at a time, which is sufficient for the benchmarks used in this research.

For larger loop nests the address expressions that depend on additional outer loops may

be done in software as in a traditional processor. A four-entry loop context register

file holds the encoded start and end counts and the increment of up to four innermost

for loops. Loops are a resource that can be allocated and managed just like one would

allocate memory on a traditional architecture. The loop unit maintains a counter for

each loop nest and updates it periodically. It also modifies the program counter and

admits new loop bodies into the pipeline in the case of modulo loops. In that case it also

does additional manipulation of the loop counter to drain the pipeline correctly on loop

termination. On entering a new loop any previous loop is pushed on a stack, though its

counter value is still available for use by address generators. Loop parameters may be

loaded from memory. This permits modulo scheduling of loops whose loop counts are

not known at compile time. Appropriate loop parameters may be loaded from SRAM at

run time depending on the size of input data.

Just before starting a loop intensive section of code, loop parameters (perhaps dynam-

ically computed) are written into the context registers using write context instructions.

On entry into each loop body, a push loop instruction pushes the index of the context

register for that loop onto the stack. At any given moment, the top of the stack represents

the innermost loop that is being executed at that time. An II counter repeatedly counts

up to the initiation interval and then resets itself. Every II cycles, the loop increment

is added to the loop variable that is held in the loop counter register file. This is done

automatically. No loop increment instructions are required. When the end count of the

loop is reached, the innermost loop will have completed. The top entry is automatically

101

popped off the stack, and the process is repeated for the enclosing loop. Note from

Figure 9.6 that the registers and datapaths have small widths of 4 and 9 bits that cover

most common loops. These widths are parameters specified in the perception processor

configuration. The netlist generator tool can generate perception processors which use

any user specified widths. The choices in Figure 9.6 were sufficient to cover benchmarks

used in this research. Loops that are incompatible with a particular perception processor

configuration can always be done in software, so the reduced bit-widths save energy in

the common case.

9.6.2 Stream Address Generators

Most perception algorithms have a high ratio of array variable accesses to operators.

Multiple SRAM ports are essential for high throughput. The three dual ported SRAMs

in Figure 9.1 together have a read/write power consumption approximately equal to the

total function unit power consumption. Since each additional SRAM port introduces area

and energy overhead, utilizing them effectively is essential for performance. A previous

version of the architecture which used generic integer ALUs for address generation was

unable to maximize SRAM port utilization [67]. This is because generating an address

for a 2D array access involves multiply/shift and add operations which incurs multiple

cycles of latency in a traditional processor. When a tight loop body involves several

array accesses a significant fraction of the function units and registers will need to be

allocated for address calculation rather than computing results. Since address calculation

for arrays is a stylized operation, it is possible to design distributed semiautonomous

address generation hardware that frees up function unit resources for the actual result

computation and improves data delivery and throughput. In the perception processor,

dedicated address generators are attached to each SRAM port. They handle commonly

occurring address sequences like vector and strided access as well as 2D array accesses

including row and column walks. They can handle address generation under regular,

modulo and unrolled loops and can deal with special situations that occur when multiple

loop bodies are in flight simultaneously.

102

Before entering into a loop intensive section of code, the compiler uses write context

instructions to write descriptions of array access patterns into the address context register

files of address generators. For increased throughput the same access pattern may be

written into multiple address generators. Each address context includes the row and

element sizes, the base address as well as the loop counter indices that correspond to the

array’s loop variables. The loop counter indices may be used to retrieve the value of loop

count variables generated by the loop unit in Figure 9.6. In the current implementation

there are four context entries in each address generator corresponding to a total of 24

access patterns simultaneously. Since write context is a single cycle operation, dynamic

reconfiguration has very low overhead. The parameters for an array access pattern are

packed into a single 32-bit word with the base address at the least significant bit. So

arithmetic can be done on the packed word to update the base address dynamically.

Address computation for array and vector references issued to an SRAM port are

handled by its attached stream address generator. The operation of the address generator

depends on loop counters from the loop unit and array parameters like base address

and row size that are stored in its address context register file. Figure 9.7 shows the

internal structure of an address generator. To understand how this simple structure can

accomplish a variety of address calculations, it is essential to understand how a compiler

generates addresses for array references. Consider the 2D arrays declared and used in C

as shown in Figure 9.8.

To simplify the discussion, assume word oriented addressing. Let the size of the

Complex struct be denoted as elem size . Then, the size of one row of A is row size =

elem size × N . If the offset of imag within the struct is 1 and the base address of A is

BaseA, then the base addresses of the imag field will be Baseimag = BaseA + 1. So

the address expressions corresponding to the load into t1 is Baseimag + i× row size +

j× elem size since C stores arrays in row major order. A vector is a single-dimensional

array, so its address expression is just a special case where row size = 0. For more

complex index expressions of the form P × i+Q, the factors P and Q may be absorbed

into the row size and base address respectively. A column-walk of the form A[j][i] can

be evaluated similarly. By constraining the row and element sizes to be powers of two,

103

Figure 9.7. Stream Address Generator

struct Complex A[N][M];struct Complex B[N][K];...for(i=0; i<N; i++) { ...

for(j=0; j<M; j++) { ...t1 = A[i][j].imag; ...for(k=0; k<K; k++) { ...

t2 = B[i][k].real;...

Figure 9.8. Loop Acceleration Example

104

the address expression reduces to the form address = Base + ((i << x)|(j << y)).

For cases where row size cannot be a power of two, to help pack more data into the

scratch memory, row size may be picked as the sum of two powers of two and separate

expressions may be used to access the bulk of the array and the residue. For arrays

with n > 2 dimensions, the base address is repeatedly recalculated to account for n − 2

dimensions and the last two levels of loop nest are left to the hardware to deal with. Not

all array accesses need to use the same loop variables. In the example, the access of

B depends on i, k unlike A which depends on i, j. The address generator is capable of

picking the correct loop variables and plugging them into the address expression.

Each address generator has a designated partner ALU in the cluster with several

address generators possibly sharing the same partner. In cases where the address gener-

ator is not equipped to compute the array index function, it is possible to directly issue

an address computed by its partner ALU. The partner ALU can also compute address

contexts on the fly and reconfigure an address generator. The combination of an address

generator and its partner ALU can effectively deal with indirect access streams of the

type A[B[i]]. Address generation adds 1 cycle latency to load/store operations.

When the compiler emits code for an array access, the index of an address generator

and the index of an address context register within that generator are encoded into the

context index field of the load/store instruction. The selected address generator then uses

the context index field to retrieve the array parameters from the context register file as

shown in Figure 9.7. The retrieved context entry specifies the loop variables to be used

for calculating the address. The muxes at the top right of the figure use this information

to select the appropriate loop counters. The mux inputs are connected to the loop count

registers of the loop unit shown in Figure 9.6. The shifters then shift the selected loop

variables, and the result is OR-ed and added to the base address to generate an address.

To improve the processor’s cycle time, pipeline registers have been inserted just before

the final add operation.

Several special cases are handled in the address generator. It is common to unroll

loops by a small factor and software pipeline them for performance. In that case, instead

of using two loop variables, it is possible to use one loop variable and one unroll factor

105

to compute the address. The unroll factor is packed into the immediate field of the

instruction and selected in lieu of the loop variable using the upper 2-to-1 mux in Figure

9.7. When the access pattern is too complex to be handled by the address generator, the

lower 2-to-1 mux selects an address that is computed by an ALU. To handle vectors and

ALU generated addresses with one or zero loop variables respectively, the loop unit has

a special loop counter which is always zero.

9.6.3 Array Variable Renaming

Setting the modulo period field in load.context/store.context instructions to a nonzero

value unlocks a performance enhancing feature called Array Variable Renaming. Modulo

scheduling makes it is possible to overlap the execution of multiple instances of the inner

loop body. Assume that the k loop from Figure 9.8 has a latency of 30 cycles and that

after satisfying resource conflicts and data dependences it is possible to start a new copy

of the loop body every 5 cycles. Then, up to 6 copies of the loop body could be in flight

through the execution pipeline. To get data dependences correct for new loop bodies, the

loop variable should be incremented every 5 cycles. However, when it is incremented,

old instances of the loop body that are in flight will get the wrong value and violate

dependences for load/store instructions that happen close to the end of the loop body.

The traditional solution is to use multiple copies of the loop variable in conjunction

with the VLIW equivalent of register-renaming – a rotating register file. Multiple address

calculations are performed, the appropriate values loaded into the register file and the

register file is rotated. For long latency loop bodies with short initiation intervals, this

leads to increased register pressure. The solution to this problem is to increment a single

copy of the loop variable every initiation interval and compensate for the increment in

older copies of the loop body which are in flight. The compensation factor, which is

really the modulo period, is encoded into the immediate field of load/store instructions.

It is subtracted from the loop variable’s value to cause dependences to resolve correctly.

In effect, this has the effect of rotating the array variable and letting a generic expression

like A[i][j] be rebound to separate addresses. Array variable renaming, effectively

converts the entire scratch pad memory into a rotating register file with separate virtual

106

rotating registers for each array accessed in a loop. Array variable renaming is much

more powerful than register rotation, but it can also be used in conjunction with a rotating

register file. This could be useful in cases in which it is possible to custom design rotating

register files that have lower latency than the SRAM and address generator combination

used to implement array renaming. Such a combination of array renaming and register

rotation can capitalize on the flexibility provided by array renaming and the low latency

provided by a custom designed rotating register file. The perception processor does not

have an architected register file at all – it merely uses array variable renaming in the place

of register-renaming to achieves very high throughput at low power.

9.6.4 Addressing Modes

The address generator can directly compute array references of the form A[i × P +

Q][j×R+S].f ield and vector accesses when both loop variables are nested loops, when

one loop has been unrolled, and more importantly when the inner loop has been modulo-

scheduled. For higher dimensional arrays, the base address is repeatedly recomputed

using an ALU, and the last two dimensions are handled by the address generator.

Another important access pattern is indirect access of the form A[B[i]]. This is a

common ingredient of neural network evaluation and can be used to implement bit-

reversed addressing for FFT. It is also a generic access pattern – any complex access

pattern can be precomputed and stored in B[] and used at runtime to access the data

in A[ ]. Vector indirect style accesses may be done by passing an ALU generated B[i]

address through the adder in Figure 9.7 thereby offsetting it with the base address of A[ ].

The ALU address can be computed, or it can be streamed into the ALU from SRAM

by another address generator. Using two address generators and an ALU, complicated

access patterns may be realized with high throughput. If the cost in terms of SRAM and

function unit usage becomes too high, the address generator may be extended for other

application specific access patterns. The stream address generator effectively converts

the scratch-pad memory into a vector register file that can operate over complex access

patterns and even interleave vectors for higher throughput. From an operational per-

spective, associating stream address generators with small scratch-pad memories unifies

107

vector and VLIW architectures.

9.7 Compiler Controlled Clock Gating

In a traditional architecture, a function unit pipeline always shifts unless a stall

situation happens. Operands enter the pipeline, and results exit it under hardware control.

A distinguishing feature of the perception processor architecture is that a compiler can

manage pipeline activity on a cycle by cycle basis. Microinstructions contain an opcode

field for each function unit in the cluster. The fetch logic enables the pipeline shift

and clock signals of a function unit only if the corresponding field is not a NOP. It can

also generate a NOP when the opcode field is used for another purpose. The net result

is that a function unit pipeline makes progress only during cycles when operations are

issued to it and stalls by default. The scheme provides fine grain software control over

clock gating while not requiring additional bits in the instruction to enable or disable a

function unit. When the result of an N-cycle operation is required, but the function unit

is not used after that operation, dummy instructions are inserted by the compiler into

following instruction slots to flush out the required value. To avoid excessive power-line

noise a compiler may keep a function unit active even when it has nothing to compute.

The regular nature of modulo scheduled loops make them good candidates for analytical

modeling and reduction of power-line noise [112].

Fine grain compiler directed pipeline control has two main purposes. First, the

compiler has explicit control over the lifetimes of values held in a pipeline unlike a

traditional architecture where values enter and exit the pipeline under hardware control

and only quantities held in architected registers may be explicitly managed. In the

perception processor, pipeline registers and the associated bypass paths may be managed

as if they were a small register file, and dataflows found in custom hardware can be

easily mimicked. Second, it lets the compiler control the amount of activity within a

cluster. Software control of dynamic energy consumption makes energy vs ILP trade-offs

possible. The resulting activity pattern can approximate the ideal condition where each

function unit has its own clock domain and runs with just the right frequency.

108

9.8 Design Flow

The hardware netlist for a perception processor is automatically generated from a

configuration description using a specially developed netlist compiler tool. The configu-

ration description is created manually based on an analysis of benchmarks. Of particular

importance to the analysis is the relative importance of various types of operators within

an algorithm. This determines the mix of function units incorporated into a perception

processor. Also important is the dataflow within loop bodies, which determines the

interconnect topology and size and number of SRAMs. It may be possible to perform this

analysis in a semiautomated manner in the future. Based on benchmark analysis an archi-

tect creates a configuration description expressed as a Python script. The configuration

script selects a set of function units from a library of components like ALUs, multipliers

and floating point units implemented using Verilog HDL and Synopsys module compiler

languages. Each function unit in the library is annotated with attributes like latency,

opcode width and names of input and output ports. Each function unit is provided a

name and a position in the eight slots available for function units. The architect also

selects the number of input and output muxes used to create the interconnect. Depending

on the type and number of function units and SRAMs the actual HDL code for the muxes

will be generated by the netlist compiler. The architect then specifies the topology of the

interconnect. This is done by specifying the names of the function units connected to

each of the input and output muxes.

The architect also describes an instruction format in symbolic form. This is a tree

structure that defines the relative position of opcode bits for each function unit and

interconnect mux within a wide instruction word. Each field is then recursively split

into subfields. It is possible to define alternate interpretations for bitfields. For example,

the opcode slots of several function units may also be used to contain reconfiguration

information for the loop unit. A shared instruction type field in each instruction word

determines which of the interpretations should be used. The netlist compiler tool converts

the configuration description into the top level HDL description of a perception proces-

sor. It generates a small instruction decoder based on the instruction format specified by

the architect. It also creates the interconnect and its constituent muxes and connects the

109

ports of various hardware modules together to create a complete perception processor

implementation.

The generated processor netlist along with HDL descriptions of various components

is processed by a series of commercial ASIC design tools. The Synopsys design compiler

maps the HDL description into a gate level netlist. A suite of specially developed

gate level netlist processing scripts analyze the input and output connectivity of each

gate in the netlist to derive heuristic estimates for wire lengths. These scripts also

modify the netlist and insert an RC component on each wire. Each RC component

is named uniquely, and the wire length associated with each component is saved to a

text database. The modified netlist and a wrapper HDL design which instantiates the

processor, SRAMs, clock generator, self-checking routines, etc., are simulated using

Synopsys Nanosim, a transistor level Spice simulator. Spice transistor models for a

0.13µ CMOS process are also provided to Nanosim. Based on the saved wire lengths

and the resistance and capacitance of the lowest level metal layer, the resistance and

capacitance of each wire in the design are computed. A script then instructs the Nanosim

simulator at run time to annotate these computed values onto the RC elements that were

inserted previously. A test-bench then loads a microprogram binary into the instruction-

SRAM. Nanosim then performs a low level simulation of the entire circuit. It periodically

samples and records the supply current to a text database. The simulation repeatedly

executes the same microprogram. At the end of each execution, self-checking routines in

the test bench verify that the results present in the output SRAM match results that were

precomputed by running a C or Python implementation of the algorithm. Simultaneously,

a specially developed numerical integration program uses the supply current database to

compute power and energy consumption. When the average power consumption result

converges, the Nanosim simulation is terminated.

The configuration description written by the architect is also used as an input to the

microcode compiler so that the compiler knows the actual configuration of the pro-

cessor it is generating code for. The compiler translates a microprogram expressed

in a limited subset of Python into a microcode binary. It then configures a generic

perception processor simulator to represent the parameters specified in the configuration

110

description. Each microprogram file also includes an additional pure Python reference

implementation of the algorithm and some test data. The microcode binary is simulated

using the test data, and output vectors are generated and saved. The simulator then runs

the reference implementation of the algorithm and verifies that the simulation results

match the reference implementation. It then saves the output vectors in a form suitable

for use with the Verilog self-checking routines described previously. Another result of the

simulation is a log of read, write and idle cycles of each SRAM. The simulator uses this

log along with SRAM power consumption information provided by the CAD tool which

generated the SRAM macrocell to compute the energy consumption of each SRAM. The

SRAM power consumption is then added to the processor power consumption computed

using numerical integration of the Nanosim output database to arrive at the overall power

consumption.

9.9 Programming Example

This section illustrates the operation of the perception processor using a simple kernel

that is mapped into microcode. The algorithm to multiply two 16 × 16 floating point

matrices is shown in Figure 9.9. The control pattern consists of 3 level nested for

loops. Assuming that the matrices are stored in row major order, the inner product

computation will access arrayA along the row whileB will be accessed along the column

causing a base stride access pattern. The compute pattern consists of multiply accumulate

operations, which form the core of the inner product function.

Figure 9.10 outlines a simple custom hardware accelerator for this algorithm. Ad-

dress generator A fetches the rows of matrix A. Address generator B generates the base

stride pattern for the columns of matrix B. Corresponding rows and columns are fetched

and applied to the floating point multiplier. The output of the multiplier is accumulated

in a scratch register by the floating point adder. When an inner product sum is ready, it

is written to a result SRAM, which is not shown in the figure.

In theory, this simple pipeline could compute one inner product every 16 cycles.

However, the final accumulation of the inner product value creates a pipeline problem.

The floating point add takes 7 cycles and since the output is accumulated, a new product

111

def inner_product(A, B, row, col):sum = 0.0for i in range(0,16):

sum = sum + A[row][i] * B[i][col]return sum

def matrix_multiply(A, B, C):# C is the result matrixfor i in range(0, 16):

for j in range(0, 16):C[i][j] = inner_product(A, B, i, j)

Figure 9.9. Matrix Multiply Algorithm

Figure 9.10. Inner Product Accelerator

value can only be handled every 7 cycles. Hence each inner product takes 16 × 7

cycles. Interleaving the computation of 7 or more inner products relieves this bottleneck.

However, this interleave complicates address generation. The additional functionality

required to fix this problem includes: a) address generator B needs to be able to generate

multiple interleaved base-stride patterns b) address generator A needs to hold each row

element long enough for all the interleaved inner products and, c) several scratch registers

are required to hold the intermediate sums. If the interleave factor is the same as the

latency of the floating point adder, no scratch registers are required. The output of the

112

adder may be fed back as an input and the intermediate sums will circulate through the

pipeline registers of the adder.

Compilers for high performance architectures attempt to approximate the dataflow in

the custom accelerator. In vector processors, vector chaining creates a similar dataflow

and reduction operators help alleviate some of the performance penalty caused by the

floating point accumulate operation. By selecting independent adds and multiplies, which

are ready for issue from its instruction window, an out of order processor will work some-

what like a vector processor that can be time sliced across several interleaved vectors. In

addition, a combination of software pipelining and branch prediction ensures that the

pipeline has as few wasted cycles as possible. Address generation will be handled by

generic ALUs which send computed addresses to available load/store ports. Some form

of register renaming will also be required to enable software pipelining to work well in

nontrivial kernels.

Figure 9.11 shows the cleaned up perception processor assembly code for the inter-

leaved inner product. For brevity the outer loops, which invoke the interleaved inner

product, are not shown. This code is capable of sustaining the same throughput (7 inner

products every 16 × 7 cycles) as the refined custom hardware accelerator. Performance

and energy efficiency are achieved by a combination of techniques.

The inner product loop i loop is marked for hardware modulo loop acceleration, and

its parameters are configured into a free context in the loop unit. Two address contexts

A ri and B ci are allocated and the address generators attached to the input SRAM ports

are reconfigured. Both contexts are tied to the loop i loop. B ci is set to generate a

column walk indexed by i loop, with the starting offset specified in a constant field in

the load opcode. A ri is set to access the matrix row by row in conjunction with an outer

loop. The address contexts effectively implement array variable renaming functions, a

fact which is not evident in the code.

On entering i loop the previous loop is pushed on a stack, though its counter value is

still available for use by the address contexts, particularly A ri. The new loop updates its

counter every 7 cycles and admits new loop bodies into the pipeline. This is not a branch

in a traditional sense and there is no branch penalty.

113

i_loop = LoopContext(start_count=0, end_count=15,increment=1, II=7 )

A_ri = AddressContext(port=inq.a_port,loop0=row_loop, rowsize=16,loop1=i_loop, base=0)

B_ic = AddressContext(port=inq.b_port,loop0=i_loop, rowsize=16,loop1=Constant, base=256)

for i in LOOP(i_loop):t0 = LOAD( fpu0.a_reg, A_ri )for k in range(0,7): # Will be unrolled 7x

AT(t0 + k)t1 = LOAD(fpu0.b_reg, B_ic, loop1_constant=k)AT(t1)t2 = fpu0.mult( fpu0.a_reg, fpu0.b_reg )AT(t2)t3 = TRANSFER( fpu1.b_reg, fpu0 )AT(t3)fpu1.add( fpu1, fpu1.b_reg )

Figure 9.11. Assembly Code for Interleaved Inner Product

Communication is explicit and happens via load/store instructions or via interfunc-

tion unit data transfers, both of which explicitly address pipeline registers. In the ex-

ample A[r][i] and B[i][c] are allocated to pipeline registers fpu0.a reg and fpu0.b reg

respectively. In fact, it is more appropriate to say that B[i][c + k] where k refers to the

kth interleaved inner product resides in fpu0.b reg at time t0 + k. No scratch registers

are required for the sum. The intermediate sums are merely circulated through the long

latency FPU adder. This notion of allocating variables both in time and space is central

to programming the perception processor.

The return value of each opcode mnemonic is the relative time at which its result

is available. The AT pseudo op is a compile time directive that controls the relative

time step in which following instructions are executed. Dataflow is arranged by referring

to the producer of a value and the time step it is produced in. Such a reference will

114

be translated by the compiler into commands for the forwarding logic. More complex

programs are written as several independent execution streams. The streams are then

made to rendezvous at a particular cycle by adjusting the starting time of each stream.

The example shows that compile time pseudo ops can perform arithmetic on relative

times to ensure correct dataflow without the programmer needing to be aware of the

latencies of the actual hardware implementation.

The loop body for i loopwill consist of 7 inner loop bodies created by loop unrolling.

Each inner loop body before unrolling takes 18 cycles to execute. Since i loop has been

specified to have an initiation interval of 7 cycles, a total of 3 i loop bodies corresponding

to 21 of the original loop bodies will be in flight within the cluster at a time. It is the

modulo aware nature of the address generators that permits each of these loop bodies

to refer to array variables in a generic manner like A[r][i] and get the reference that is

appropriate for the value of r and i which were current at the time that loop body was

started. Without special purpose address generation, such high levels of ILP will not be

possible. A previous version of the architecture without modulo address generators had

limited ILP because generic function units and registers were used for address generation

[67].

For this example, interleaving 7 inner products at a time results in two left over

columns. They are handled by a similar loop to the one shown in Figure 9.11 except that

it will have more idle slots. The adder needs to be active all the time, but the multiplier

needs to work only 2 out of every 7 cycles. Since the multiplier pipeline will not shift

5 out of 7 cycles, the dynamic energy consumption resembles an ideal circuit where

the adder runs at full frequency and the multiplier runs at 2/7 of the frequency thereby

consuming less energy.

The overall effect is that the dataflow and throughput of the perception processor

matches the custom hardware but in a more programmable manner. The address gen-

erators transfer data between the SRAMs and execution units in a distributed and au-

tonomous manner similar to the custom accelerator in Figure 9.10. The output of the

multiplier is directly forwarded to the input of the adder. As in the case of the accelerator,

no scratch registers are used. The intermediate sums are circulated through the pipeline

115

registers in the adder. All together, the microcode and the interconnect provide a level of

programmability while retaining a level of hardware economy close to that of the ASIC.

CHAPTER 10

EVALUATION

The benefits of the perception processor architecture are tested on 10 benchmarks

that were chosen both for their perceived importance in future embedded systems as well

as for their algorithmic variety. In order to compare the approach to the the competition,

four different implementations of benchmarks are considered:

1. Software running on a 400 MHz Intel XScale processor. The XScale represents an

energy efficient embedded processor.

2. Software running on a 2.4 GHz Intel Pentium 4 processor. The Pentium 4 is

optimized for performance rather than energy efficiency since more efficient pro-

cessors can not currently support real-time perception tasks such as speech recog-

nition.

3. A microcode implementation running on the perception processor.

4. Four of the benchmarks have been implemented as custom ASICs since ASICs

represent a high level of performance and energy efficiency that general purpose

processors are seldom able to match.

10.1 BenchmarksThe first two algorithms called GAU and HMM described in Chapter 4 are dominant

components of the Sphinx 3.2 speech recognizer. The next five algorithms named Row-

ley, Fleshtone, Erode, Dilate and Viola are components of the visual feature recognition

system described in Chapter 7. The last three algorithms are FFT, FIR and Rijndael

and these are taken from the DSP and encryption domains. The DSP algorithms were

added to test the generality of our approach. DSP functions like FFT and FIR are

117

important components of speech recognition front ends and image processing algorithms.

Encryption is of increasing importance to secure embedded systems. Rowley, GAU, FFT

and Fleshtone are floating point intensive. The remaining benchmarks are integer only

computations. Some components of GAU, Rowley and Fleshtone may be vectorized

while the rest of the algorithms cannot. HMM is intensive in data dependent branches

which may be if-converted.

Several source level optimizations have been made to the software versions that run

on the Pentium and XScale to boost their performance as much as possible [66]. The

optimizations included hand unrolled loops, partial specialization of functions when

some arguments are known statically, replacing expensive functions with table lookups,

reshaping data structures for better cache locality and a variety of algorithm optimiza-

tions discussed in Chapters 5 and 7. No SIMD optimizations were made in order to

keep the comparison fair. The perception processor could use SIMD floating point units,

just like SSE on the Pentium, but widening datapaths makes isolating the impact of

architectural options like compiler controlled dataflow impossible. A brief description of

the benchmarks follow.

GAU and HMM represent Gaussian probability density evaluation and hidden Markov

model evaluation respectively. GAU occupies 57.5% and HMM consumes 41.5% of the

execution time of the Sphinx 3.2 speech recognition system. Both Gaussian distribu-

tions and hidden Markov models are components of most mature speech recognizers

[59, 111, 91]. GAU computes how closely a 10 ms frame of speech matches a known

Gaussian probability distribution. One input packet corresponds to evaluating a single

acoustic model state over 10 frames of a speech signal. A real-time recognizer needs to

process 600,000 invocations of the GAU algorithm every second. The HMM algorithm

performs a Viterbi search over a hidden Markov model corresponding to one model state.

One input packet to the HMM implementation consists of 32 five-state hidden Markov

models. While the GAU algorithm is entirely floating point, the HMM algorithm is

dominated by integer compare and select operations. Its average rate of invocation varies

significantly with context, but to guarantee real-time performance it is assumed in this

research that all HMM models are evaluated thereby brute forcing a large component of

118

speech processing.

Rowley represents a neural network based visual feature detector [83]. In the face

recognizer a multilayer neural network is swept over 30 × 30 rectangular regions of

an image. Each individual neuron is evaluated by the function: tanh(Σni=1Weighti ×

Image[Connectioni]). Neurons have multiple sizes for their fan-in (n), and each layer

depends on the preceding layer’s output. The software implementation developed for

this dissertation used hand unrolled, specialized versions of neuron evaluation functions

for each input size. Also, tanh() was implemented via table lookup whereas Rowley’s

original implementation used the tanh() function in the C library. This optimization

boosted the Pentium’s performance by a factor of 2.5. A 30 × 30 image as well as the

outputs of all the neurons are maintained within the perception processor. Depending on

the sizes of the neurons an input packet consisting of the weights and connections of 7

to 64 neurons is streamed through the perception processor. All computations involve

single precision floating point numbers.

Fleshtone represents a skin toning algorithm typically used as a preprocessing step

to find skin colored regions of an image so that a more sophisticated object detector like

the Rowley detector may be applied to it. The benchmark converts RGB pixels to another

color space and checks if the projected pixel falls in between two parabolic curves [90].

This algorithm represents a case that is difficult to vectorize since there are far more

floating point operators per pixel than the number of FPUs present in the cluster. This

necessitates multiple passes and saving of intermediate results. It also contains multiple

if statements in the body. Each input packet consists of a single raster line of a 320×200

24-bit color image. The output is a 320-entry bitmap whose elements are set where flesh

color is found.

Erode and Dilate represent two operators from mathematical morphology that help

in image segmentation. Erode sweeps a 3 × 3 pixel filter over the bitmap produced by

Fleshtone and cuts away weakly connected regions, i.e., it blacks out pixels if all pixels

within the filter are not set. Dilate does the opposite, it sweeps a 5 × 5 pixel filter over

a bitmap and fills in pixels if any of the pixels are set. Fleshtone, Erode and Dilate are

used for image segmentation in a visual feature recognition system [65]. Erode works on

119

three raster lines and dilate works on five raster lines of a 320 × 200 image.

Viola is a reimplementation of the Viola and Jones’ method of object detection based

on a well known machine learning algorithm known as AdaBoost [103]. The algorithm

relies on computing features or wavelets which are the weighted sum or difference of

rectangular regions within a 30 × 30 window into an image. The coordinate and weight

information for 100 features are maintained within the perception processor. Each input

packet contains a 30 × 30 pixel image. The output contains the evaluation of all 100

features over the 30 × 30 image.

FFT implements a 128 point complex to complex Fourier transform on floating point

data. The Fourier coefficients are maintained within the perception processor. Input and

output packets consist of 128 complex numbers where each complex number consists

of two single-precision floating point numbers. FFT represents a common algorithm for

which many DSP processors implement ISA extensions. FFT also represents a case that

causes bad interconnect conflicts on our architecture. Good performance depends on

the interconnect borrowing technique described in Section 9.5. The software version on

the Pentium is based on FFTW, a highly tuned FFT implementation which used dynamic

programming techniques to adapt itself to the processor architecture [38]. The microcode

implementation on the other hand uses a simple radix-2 algorithm and no ISA extensions.

Since FFTW cannot be used on the XScale, the simple radix-2 algorithm is used instead.

FIR is a 32 tap finite impulse response filter, a common primitive in DSP appli-

cations. Impulse response coefficients are maintained inside the perception processor.

Input packets of various sizes may be applied to the filter, which successively evaluates

each input and outputs one integer corresponding to every input word.

Rijndael is the advanced encryption standard. The particular version implemented

here uses 128 bit keys and works on 16 byte blocks [29]. Input blocks are 576 bytes long

to simulate network level encryption. The default maximum size of Internet packets

is 576 bytes. The key as well as the encryption S-boxes are maintained within the

perception processor.

120

10.2 Metrics

The trade-off between energy consumption and performance is a common modern

design choice. Increasing performance almost always involves increasing the energy

requirements. As a result, it is misleading to compare solely on the basis of either energy

or performance. This dilemma is even more meaningful for the real-time embedded

perception applications that are the driving force for this work. The ability to process

faster than real time simply means that power is being wasted. Therefore a common

tactic in such cases is to either reduce clock frequency, supply voltage, or both. The

fine grain scheduling capability of the perception processor also enables the work rate to

be scheduled, which is a more intuitive mechanism and achieves results similar to clock

frequency scaling.

An attractive and intuitive metric is to compare designs based on the energy expended

to perform work at some rate [21]. Gonzalez and Horowitz showed that Spec2/Watt, or

its inverse, the energy delay product, is a good metric of architectural merit [39]. Both

architecture and semiconductor process influence the energy delay product. Since the

feature size of the process, λ, has such a large impact it is necessary to normalize any

design comparison to the same process. The normalization techniques applied to the

results were described in Section 3.3.

The perception processor and the Pentium 4 are both implemented in 0.13µ CMOS

technology and their results need not be normalized. The XScale and the custom ASICs

are implemented using 0.18µ and 0.25µ technologies respectively, and their results are

normalized using this method to a 0.13µ technology. The metrics used for evaluating

the perception processor are: IPC, power, throughput, energy consumed to process each

input packet, energy delay product and ET 2.

10.3 Experimental Method

Hardware netlists for two different perception processor configurations were gener-

ated for this evaluation. They will henceforth be referred to as the integer cluster and

the floating point cluster. The integer cluster consists of four ALUs, two multiply units,

and the remaining two slots are unused. The floating point cluster contains four ALUs

121

and four FPUs. All of the integer benchmarks except FIR and Viola would run equally

well on the floating point cluster. FIR and Viola require integer multiply operations.

The hardware for each configuration (the entire organization shown in Figure 9.1) was

generated. The input and scratch SRAMs are sized at 8 KB each and the output SRAM

is 2 KB in size. The design is simulated at the transistor level using Spice while running

the microcode for the benchmarks. The Spice simulation provides a supply current

waveform with one sample per 100 pico seconds. This information along with the

supply voltage is used to compute instantaneous power consumption. Then numerical

integration of power over time is performed to compute energy consumption.

The dual-ported SRAMs are macrocells generated by an SRAM generator tool and

simulating the entire SRAM array using Spice is not feasible. For the SRAMs each

read, write and idle cycle were logged. The normalized energy consumption was then

computed based on the read, write and idle current reported by the SRAM generator.

Each benchmark is run for several thousand cycles until the energy estimate converges.

This chapter assumes a framework similar to Figure 1.2 where a host processor and

memory controller combination transfers data into and out of the perception proces-

sor’s local SRAMS. The perception processor operates only on data present in local

SRAM and has no means of accessing main memory. To isolate main memory system

power consumption and compare the merits of the processors in a fair manner, both the

perception processor and the general purpose processors are forced to repeatedly reuse

data which has already been transferred into on-chip memory. The host processor is not

simulated.

The function units are described in Verilog and the Synopsys module compiler lan-

guage. The overall cluster organization and interconnection between function units is

automatically generated by the compiler. The whole design is then synthesized to the

gate level and a clock tree is generated. The net list is then annotated with heuristic worst

case RC wire loads assuming all routing happened on the lowest metal layer. The energy

measurements are therefore likely to be pessimistic. Exact measurements are extremely

sensitive to wire routing decisions, and as a result wire capacitance calculations were

based on the worst-case wiring layer. The microcode corresponding to the benchmark

122

is loaded into program memory and the Spice model is simulated in NanoSim, a com-

mercial VLSI tool with Spice-like accuracy. The circuits were originally designed for a

0.25µ CMOS process but were subsequently retargeted to a 0.13µ process [22, 23]. Only

the 0.13µ results are reported here.

The software version of each benchmark was compiled with the GNU GCC com-

piler using the O3 optimization level and run on a 2.4 GHz Intel Pentium 4 processor.

This system has been modified at the board level to permit measuring average current

consumed by the processor module using a digital oscilloscope and nonintrusive current

probe. Several million iterations of each benchmark algorithm were run with the same

input data to ensure that the input data always hits in the L1 Cache. So the L2 Cache and

memory system effects are isolated as much as possible and the measurement represents

core power. For the XScale system a similar approach is used except that software control

is used to turn off unnecessary activity and the difference between the quiescent state and

the computation is measured. This method could slightly inflate the processor power,

but measuring the core power alone is not technically feasible on this system due to

packaging constraints. The choice of both systems were based on the technical feasibility

of PCB modifications to permit measuring energy consumption.

Embedded processors like the XScale do not have floating point instructions that

are required for some of the benchmarks. Software emulated floating point will bloat

the energy delay product of the XScale and make a meaningful comparison impossible.

Therefore the comparison is done against an ideal XScale, which has FPUs that have the

same latency and energy consumption as an integer ALU. This is done by replacing

each floating point operator in the code with a corresponding integer operator. The

code is then run on a real XScale processor. Henceforth, the name XScale refers to the

idealized XScale implementation. Floating point units typically incur several times the

latency and power overheads of their integer counterparts. The results computed by the

algorithm after replacing floating point operators with integer operators are meaningless,

but the performance and energy consumption represent a lower bound for any real XScale

implementation with FPUs. This makes the XScale results look better than they really

are.

123

10.4 Results

The design goal of the perception processor was to achieve high performance for

perceptual algorithms at low power. For stream computations, a very important consid-

eration is whether a system has sufficient throughput to be able to process the data rate

in real time. Since dynamic energy consumption is directly proportional to operating

frequency, one method for achieving this goal is to exploit high levels of instruction level

parallelism for stylized applications without a paying a high price in terms of hardware

complexity. The details of this approach were discussed in Chapter 3. Before the details

are presented it is important to note a few details.

1. With the exception of Figures 10.1 and 10.7, the Y axis of all graphs use a loga-

rithmic scale on account of the large range of data.

2. Energy, energy delay product, energy delay squared product and power numbers

in all the graphs in this chapter are normalized to a 0.13µ process. Since the

perception processor and the Pentium are both implemented in 0.13µ processors,

their normalized results correspond to the actual results. Only the XScale and

ASIC numbers are actually scaled.

3. In this chapter the terms average and mean refer to the geometric mean.

10.4.1 Instruction Level Parallelism

Figure 10.1 shows the IPC of the perception processor compared against the IPC

measured using native performance counters on an SGI R14K processor. The bench-

marks were compiled for the R14K using the highly optimizing SGI MIPSpro compiler

suite. The perception processor achieved a mean improvement in IPC of 3.3 times

over the sophisticated super-scalar out of order processor. Figure 10.1 also shows the

breakdown of IPC between execution units and the memory system. It may be seen

that a large fraction of the IPC improvement may be directly attributed to the memory

system, which can transfer data at a high rate into and out of the function units. This leads

to high function unit utilization and high IPC. Since each load/store instruction triggers

an address calculation operation, the two are counted as separate instructions. Though

124

FFT

Fltone Gau

Rowley

Dilate

Erode

HMM

Viola

FIR Rijn0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0IP

C

1.7

1.6

2.3

2.4

2.4

2.5

1.6

2.4

2.2 2.3

0.0

4.5

3.4

5.8

5.3

5.2 5.

4

4.1

6.0

3.2

3.9

0.0

1.9

2.0

3.5

2.7 3.

23.

1

1.2

2.9

2.4

1.6

0.0

6.4

5.3

9.3

8.0 8.

4

8.4

5.3

8.9

5.6

5.6

0.0

R14K Perception Proc MEM IPC Perception Proc EX IPC

Figure 10.1. IPC

an address calculation is counted as a single instruction it should be understood that it

does the equivalent of several shift, mask, and add operations on a regular processor as

explained in Section 9.6.2. The results clearly demonstrate that the design goal of high

throughput through ILP has been achieved.

10.4.2 Power Consumption

Figure 10.2 shows the process normalized steady state power consumption of the

different implementations. It is seen that even though the perception processor harvests

high levels of ILP, its power consumption in the integer configuration is lower than the

125

FFT

Fltone Gau

Rowley

Dilate

Erode

HMM

Viola

FIR Rijn1e-01

1e+00

1e+01

1e+02

1e+03P

ower

(W

atts

)

0.67

5

0.64

8

0.67

0

0.66

6

0.65

7

0.66

6

0.67

0

0.66

2

0.64

4

0.67

5

44.6

40

39.0

60

44.1

75

44.7

95

42.6

25

43.7

10

47.1

20

44.9

50

44.3

30

44.7

95

0.72

6

0.67

0

0.75

7

0.74

5

0.49

1

0.50

6

0.56

7

0.55

2

0.40

9

0.47

5

0.34

7

0.34

7

0.08

2

0.60

3

XScale Pentium 4 Perception Processor ASIC

Figure 10.2. Power Consumption

single issue XScale embedded processor, and the power consumption of the floating

point configuration exceeds the XScale by at most most 14.4% . It should be noted that

in reality, the XScale’s power consumption for the floating point benchmarks can never

be as low as the values shown in Figure 10.2. As mentioned in Section 10.3, for the

floating point applications, the experiments represent an ideal XScale processor where

a floating point operation consumes only as much power as its integer counterpart. An

XScale implementation with a floating point unit would likely consume more power for

the floating point benchmarks. To be fair to the competition, it is worth noting that the

XScale is significantly more general than the perception processor since it has a TLB,

126

caches and a memory controller. The benchmarks do not exercise the memory controller.

The perception processor lacks that level of generality, but possesses eight function units,

address generators, loop accelerators and scratch-pad memory, which are not present in

the XScale.

Both the Pentium and the perception processor exhibit significant variability in power

consumption depending on the application whereas the power consumed by the XScale

is relatively independent of the application. For example, among the floating point

algorithms run on the perception processor, GAU has the highest power consumption

of 0.757 W while Fleshtone has the least at 0.67. This corresponds to a 11.5% energy

optimization achieved through compiler controlled data flow and compiler controlled

clock gating. For the integer configuration the application dependent power variation is

even larger. There is 27.9% power savings when comparing HMM and FIR. In contrast

the maximum application dependent power variation in the XScale happens between

Rijndael and FIR corresponding to 4.6% power savings. The Pentium achieves a 17.1%

power difference between Fleshtone and HMM. The perception processor thus possesses

a superior ability to capitalize on application dependent power saving opportunities.

10.4.3 Throughput

Figure 10.3 shows the throughput of the perception processor, the Pentium 4 and

the XScale processors as well as ASIC implementations. Throughput is defined as the

number of input packets processed per second and the results shown in Figure 10.3 are

normalized to the throughput of the Pentium 4. The perception processor operating at

1 GHz outperforms the 2.4 GHz Pentium 4 by a factor of 1.75 (Geometric Mean). The

perception processor’s mean throughput is 41.4% of that of the ASIC implementations

(GAU, Rowley, FIR, Rijndael). This is severely skewed by the fact that the ASIC

implementations, particularly Rijndael, expend vastly more hardware resources than the

perception processor. This is evident from Figure 10.2, which shows that in the case of

Rijndael, the ASIC consumes more than twice the power of the perception processor.

For the set GAU, Rowley and FIR, the perception processor in fact achieves on average

84.6% of the throughput of the ASIC implementation. These results clearly demonstrate

127

FFT

Fltone Gau

Rowley

Dilate

Erode

HMM

Viola

FIR Rijn0.01

0.1

1.0

10.0

100.0T

hrou

ghpu

t

0.06

0.81

0.13

0.13 0.15

0.13

0.11 0.

16

0.12 0.

15

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

0.64

6.43

2.57

1.57 1.

96

1.76

1.45

1.29 1.

57

1.58

2.94

1.72 2.07

32.4

1XScale Pentium 4 Perception Processor ASIC

Figure 10.3. Throughput Normalized to Pentium 4 Throughput

the benefit of the perception architecture to the problems posed by perceptual algorithms.

Two of the benchmarks demand further explanation. FFT is the only benchmark

where the Pentium outperforms the perception processor. This is due to the fact that

the version of FFT used on the Pentium is based on FFTW, one of the fastest FFT

libraries in existence. It uses a mixture of processor specific measurements and dy-

namic programming optimizations to adapt itself to the specific system it is run on.

The perception processor on the other hand uses a simple radix-2 algorithm as does

the XScale implementation. This is on account of the fact that FFTW is implemented

as a large C library and is difficult to reimplement manually in microcode without the

128

aid of a C compiler that targets the perception processor. XScale lacks the floating

point hardware to support FFTW. The radix-2 algorithm is not particularly well suited

for the perception processor since it causes bad interconnect conflicts that lead to too

high an initiation interval for the main loop. In spite of these adversities the perception

processor implementation achieves 64% of the performance of the Pentium at less than

half its clock frequency. DSP processors typically implement a bit-reversed address

space to improve the performance of FFT [42]. The main reason for the reasonable

FFT performance of the perception processor is that it uses hardware support for vector

indirect accesses to implement a bit-reversed addressing mode for this application. An

indirection vector that corresponds to bit-reversed array indices is kept and used from the

scratch SRAM.

The other outlier is Fleshtone, the benchmark on which the perception processor

performs the best. Though this is a relatively simple algorithm, it involves numerous

floating point operations. Since the number of operators far exceed the number of

function units available on the perception processor, the dataflow graph of the algorithm

was split into several small subgraphs, and multiple passes were made over an input

packet (320 pixel raster line) to fully evaluate the algorithm. Numerous temporary values

are generated in the process, and these are stored in the SRAMs between successive

passes. The Pentium version on the other hand fully evaluates the algorithm on each pixel

before moving on to the next pixel in the input packet. The floating point register stack

in the x86 architecture is inadequate to capture the amount of temporary results created.

This results in several unnecessary moves, exchanges, loads and stores of intermediate

values. The main loop body generated by GCC contains over 80 instructions and takes

more than 208 cycles on average per iteration. In the case of the perception processor,

compiler controlled dataflow reduces the number of temporaries and the SRAM memory

permits storage of a very large number of intermediate results – over 1600 values in six

passes. Ultimately, this leads to the perception processor outperforming the Pentium by

a factor of 6.4.

129

10.4.4 Energy Consumption

In battery powered systems, the energy consumed to complete a task is often a more

relevant metric than power. Circuit designers often have the ability to trade-off power

for performance. Thus it is possible for a high power system, which rapidly completes a

task, to consume less energy than a low power system that steadily draws power for

an extended period to complete the same task. Battery life for mobile systems can

be extended by being energy efficient, not necessarily by being low power. Figures

10.2 and 10.3 showed that the perception processor has low power consumption and

high performance. This in turn translates to a high degree of energy efficiency. Fig-

ure 10.4 shows the per packet energy consumption of the perception processor and its

competition. While delivering 11.8 times the performance of the XScale processor, the

perception processor consumes 13.5 times less energy than the XScale for each input

packet. General purpose processors exact a high energy cost for their generality and

programmability when compared to ASICs. From the results in Figure 10.4 it is possible

to compute that on average the XScale consumes 79.3 times more energy per input packet

when compared to the ASIC implementations (Gau, Rowley, FIR, Rijndael). In sharp

contrast, the perception processor’s energy consumption is only five times larger than

that of the ASIC. The perception processor thus radically improves energy efficiency

while retaining a high level of generality and programmability.

10.4.5 Energy Delay Product

Though CMOS circuits often have the ability to trade energy for performance, it is

quite difficult to improve both energy and performance simultaneously. Gonzalez and

Horowitz argue that the process normalized energy delay product (EDP) or alternately,

Spec2/Wattλ2, which corresponds to the inverse of EDP, is a relatively implementa-

tion neutral metric [39]. They demonstrate that this metric causes the architectural

improvements that contribute the most to both performance and energy efficiency to

stand out. For example, their results demonstrate that pipelining is of fundamental

importance to processor performance and energy efficiency, but super scalar issue is

a lesser contribution. Figure 10.5 shows the process normalized energy delay product

130

FFT

Fltone Gau

Rowley

Dilate

Erode

HMM

Viola

FIR Rijn1e-04

1e-03

1e-02

1e-01

1e+00

1e+01E

nerg

y (m

J/In

put)

2.6e

-02

2.2e

-02

4.8e

-02

6.6e

-03 2.

3e-0

2

1.5e

-02

4.9e

-03

4.9e

-03

3.6e

-02

6.4e

-02

1.1e

-01

1.1e

+00

4.1e

-01

5.7e

-02 2.

2e-0

1

1.3e

-01

3.9e

-02

5.3e

-02

2.8e

-01

6.3e

-01

2.8e

-03

2.9e

-03

2.7e

-03

6.0e

-04

1.3e

-03

8.4e

-04

3.3e

-04

5.1e

-04 1.

7e-0

3

4.2e

-03

1.1e

-03

2.6e

-04

2.5e

-04

2.6e

-04


Figure 10.4. Process Normalized Energy Consumption

(EDP) of the four different designs. It may be seen that in spite of their radically different

architectures, the XScale’s EDP is within 31.4% of the EDP of the Pentium if we ignore

the outliers FFT and Fleshtone. The FFT result is different because the XScale uses a

simple radix-2 algorithm instead of the optimized FFTW library used on the Pentium.

The Fleshtone result underlies the fact that for this floating point benchmark, the XScale

is modeled as an ideal implementation. The floating point version of this algorithm has

a performance problem on the Pentium as explained in Section 10.4.3.

It is evident from Figure 10.5 that the perception processor has a radically better

EDP, which is often one or two orders of magnitude better than its competition. It is

131

FFT

Fltone Gau

Rowley

Dilate

Erode

HMM

Viola

FIR Rijn1e-05

1e-04

1e-03

1e-02

1e-01

1e+00

1e+01

1e+02

1e+03E

nerg

y D

elay

Pro

duct

(J*

1e-9

s)

1.0e

+00

7.7e

-01

3.4e

+00

6.5e

-02

8.1e

-01

3.3e

-01

3.6e

-02

3.6e

-02

2.0e

+00

6.0e

+00

2.7e

-01

3.0e

+01

3.7e

+00

7.2e

-02

1.2e

+00

3.7e

-01

3.3e

-02

6.3e

-02

1.8e

+00 8.

8e+

00

1.1e

-02

1.3e

-02

9.7e

-03

4.8e

-04 3.

5e-0

3

1.4e

-03

1.9e

-04

4.7e

-04

6.7e

-03 3.7e

-02

3.4e

-03

1.9e

-04

7.7e

-04

1.1e

-04


Figure 10.5. Process Normalized Energy Delay Product

particularly noteworthy that in the case of FFT where the perception processor achieved

only 64% of the throughput of the Pentium, it improves EDP by a factor of 24.5. This

may be largely attributed to the higher energy efficiency of the perception processor. The

perception processor on average improves on the EDP of the XScale by a factor of 159

and is only 12 times worse than the ASIC. The perception processor is thus able to bridge

the wide gap in EDP between CPUs and ASICs.

132

10.4.6 Energy Delay Squared Product

Martin, Nystroem and Penzes argue that ET 2 is a voltage independent metric that

is better than the energy delay product [64]. For reasons explained in Section 3.5 this

research uses ET 2 merely as a metric that favors performance at the cost of energy.

Figure 10.6 compares the ET 2 efficiency of the perception processor against its

competition. Since this metric favors performance over energy, in most cases the Pentium

outperforms the XScale unlike the situation in Figure 10.5. The perception processor

outperforms the Pentium on average by a factor of 405 while it is 1869 times better than

the XScale. The ASIC is only 29 times better than the perception processor.

FFT

Fltone Gau

Rowley

Dilate

Erode

HMM

Viola

FIR Rijn1e-02

1e-01

1e+00

1e+01

1e+02

1e+03

1e+04

1e+05

1e+06

1e+07

1e+08

E*T

^2 (

J*1e

-18

s^2)

4.0e

+04

2.6e

+04 2.

5e+

05

6.5e

+02

2.8e

+04

7.3e

+03

2.6e

+02

2.7e

+02

1.1e

+05

5.7e

+05

6.8e

+02

8.4e

+05

3.5e

+04

9.1e

+01

6.2e

+03

1.1e

+03

2.7e

+01

7.5e

+01

1.1e

+04 1.

2e+

05

4.3e

+01

5.4e

+01

3.5e

+01

3.9e

-01

9.4e

+00

2.3e

+00

1.1e

-01

4.3e

-01

2.7e

+01 3.

3e+

02

1.1e

+01

1.4e

-01 2.

4e+

00

4.9e

-02


Figure 10.6. Process Normalized Energy Delay Squared Product (ET 2)

133

10.4.7 Clock Gating

Figure 10.7 shows the synergistic effect of applying clock gating to a cluster that

supports compiler controlled datapaths. Compiler controlled datapaths provide energy

reduction by decreasing datapath activity and avoiding register file and SRAM accesses.

To implement it, the load enable signal of each pipeline register should be controlled by

software. Since compiler controlled data flow demands circuits with software controlled

pipeline register enable signals, it is a trivial extension to clock gate pipeline registers

using the same signals. It is seen in the graph that on average this saves 39.5% power

when compared to the implementation without clock gating. These results are affected

by two factors: a) SRAM power adds a large constant factor to both the cases and, b)

Multicycle datapaths like the FPUs are not clock gated because of limitations of the CAD

tools. Further reduction is possible by clock gating multicycle datapaths.

FFT

Fltone Gau

Rowley

Dilate

Erode

HMM

Viola

FIR Rijn0

0.25

0.5

0.75

1.0

1.25

1.5

Pow

er W

atts

0.73

0.67 0.

76

0.74

0.49

0.51 0.

57

0.55

0.41 0.

47

1.28

1.27 1.

33

1.33

0.76

0.73

0.87

0.86

0.67 0.

78

Clock gated Not clock gated

Figure 10.7. Impact of Clock Gating

134

10.4.8 The Cost of Generality

It could be argued that the perception processor achieves impressive power sav-

ings because it lacks the level of generality possessed by the Pentium or the XScale.

The perception processor is believed to be Turing complete since it has instructions

for integer arithmetic, comparisons, conditional moves, loads, stores and direct and

indirect branches. However, Turing completeness is no measure of the ability to execute

arbitrary programs efficiently. While it is possible to modify the perception processor

for efficiency in the general case by traditional means like adding caches and branch

prediction, consider the simpler alternative of using a perception processor to augment

a general purpose processor. The generic sections of perception applications run on a

host processor, and the perception specific algorithms run on the perception processor

attached to the host processor. How efficient could such an organization be ?

Consider the case where the host processor is an XScale. This scenario represents a

complete system since the XScale contains its own memory controller. It is true that

additional interface circuits will be required between the XScale processor core, the

memory controller and the perception processor. However, such additional circuitry is

likely to be a very small portion of the hardware of the complete system and should not

affect the results presented here significantly. It is also the case that the XScale is ill

suited for this application since it consumes too much power for its performance level

and possesses too much generality. A low power DSP might be a better choice for a host

processor. But choosing an inefficient host processor makes the results presented in this

section very conservative.

Figure 10.2 shows that the process normalized peak power consumptions of the

XScale and the perception processor are 0.675 W and 0.757 W respectively. Consider a

chip multiprocessor called PP+ consisting of an XScale core and a perception processor

on the same die. PP+ will then have a peak power consumption of 1.4 W. To make the

results conservative assume that PP+ consumes 1.4 W of power for all the benchmarks

even though in reality the application specific power savings will be significant. Figure

10.8 shows the energy consumed by PP+ to process each input packet. It may be seen

that in spite of the addition of a host processor, PP+ has a significantly lower energy

135

FFT

Fltone Gau

Rowley

Dilate

Erode

HMM

Viola

FIR Rijn1e-04

1e-03

1e-02

1e-01

1e+00

1e+01E

nerg

y (m

J/In

put)

2.6e

-02

2.2e

-02

4.8e

-02

6.6e

-03 2.

3e-0

2

1.5e

-02

4.9e

-03

4.9e

-03

3.6e

-02

6.4e

-02

1.1e

-01

1.1e

+00

4.1e

-01

5.7e

-02 2.

2e-0

1

1.3e

-01

3.9e

-02

5.3e

-02

2.8e

-01

6.3e

-01

5.6e

-03

6.2e

-03

5.1e

-03

1.2e

-03 3.

8e-0

3

2.4e

-03

8.3e

-04

1.3e

-03 5.

8e-0

3

1.3e

-02

1.1e

-03

2.6e

-04

2.5e

-04

2.6e

-04

XScale Pentium 4 PP+ ASIC

Figure 10.8. Energy Consumption of PP+

consumption than the XScale and the Pentium. This is on account of the fact that energy

is the integral of power over time. Even though PP+ has a higher power consumption

than the XScale, because of its superior performance it is able to complete tasks faster

and thus consumes less energy. In particular PP+ consumes 5.5 and 53.6 times less

energy per packet than the XScale and the Pentium respectively. It is only a factor of

12.4 worse than the ASIC implementations.

Figure 10.9 shows the energy delay product of the PP+. Since the power consumption

of the PP+ is slightly larger than twice the power consumed by the perception processor,

the energy delay product is expected to be a scaled down version of Figure 10.5. This

136

FFT

Fltone Gau

Rowley

Dilate

Erode

HMM

Viola

FIR Rijn1e-05

1e-04

1e-03

1e-02

1e-01

1e+00

1e+01

1e+02

1e+03E

nerg

y D

elay

Pro

duct

(J*

1e-9

s)

1.0e

+00

7.7e

-01

3.4e

+00

6.5e

-02

8.1e

-01

3.3e

-01

3.6e

-02

3.6e

-02

2.0e

+00

6.0e

+00

2.7e

-01

3.0e

+01

3.7e

+00

7.2e

-02

1.2e

+00

3.7e

-01

3.3e

-02

6.3e

-02

1.8e

+00 8.

8e+

00

2.2e

-02

2.7e

-02

1.8e

-02

9.3e

-04

1.0e

-02

3.9e

-03

4.8e

-04

1.2e

-03

2.3e

-02

1.1e

-01

3.4e

-03

1.9e

-04

7.7e

-04

1.1e

-04

XScale Pentium 4 PP+ ASIC

Figure 10.9. Energy Delay Product of PP+

is indeed the case with PP+ outperforming the XScale and the Pentium by factors of

64.1 and 93.6 respectively and it under-performs the ASIC implementations by a factor

of 30. The results clearly demonstrate the benefit of using perception processors as

coprocessors to general purpose processors.

10.5 Summary

The architectural features of the perception processor enable it to provide 1.74 times

the throughput of a Pentium 4 while consuming 13.5 times less energy than an XScale

embedded processor. Its architectural efficiency allows it to reach 41.4% of the through-

137

put of the ASIC at five times the energy consumption of the ASIC – a small price for its

generality and programmability. Since the processor circuits were evaluated at the netlist

level and not laid out, rigorous area estimates were not made. Approximate estimates

show that the die area is dominated by the amount of SRAM used in the design and

the function units and interconnect occupy only a small fraction of the overall area. For

typical high performance embedded systems, having adequate compute ability at a low

energy budget is the critical factor, not area. The microprograms for the benchmarks

discussed in this chapter took approximately 10 to 20 man-hours each to develop. The

effort required can be drastically reduced if a high level language compiler is developed.

In contrast, ASIC implementations of benchmarks like FFT and Fleshtone might take

several man-months of effort. Altogether these radical improvements suggest that in

cases where high performance, low design time and low energy consumption need to be

addressed simultaneously, the perception processor could be an attractive alternative.

CHAPTER 11

CONCLUSIONS

Natural human interfaces built on technologies like speech recognition, gesture recog-

nition, object detection and tracking are central to the widespread acceptance of future

embedded systems. The chances for today’s isolated embedded devices to develop

into tomorrow’s ubiquitous computing environment also depends on services like secure

wireless networking, media processing and integration with visual and audio interfaces.

The levels of performance and power efficiency required to achieve these goals are orders

of magnitude beyond the ability of current embedded processors. Application specific

processor architectures can effectively solve some of these challenges.

The performance characteristics of a face recognition system based on well-known

algorithms and a leading research speech recognition system were analyzed. By recasting

these perception algorithms as well as DSP and encryption algorithms on to an archi-

tecture optimized for stream processing, high levels of ILP and energy efficiency were

demonstrated. The perception processor uses a combination of VLIW execution clusters,

compiler directed dataflow and clock gating, hardware support for modulo scheduling

and special purpose address generators to achieve high performance at low power for

perception algorithms. Operationally, the combination of stream address generators and

scratch-pad memories represent a unification of VLIW and vector styles of execution.

The perception processor is a fairly minimal, yet programmable hardware substrate that

can mimic the dataflow found in ASICs. It outperforms the throughput of a Pentium

4 by 1.75 times with an energy delay product that is 159 times better than an XScale

embedded processor. Its energy delay product is just 12 times worse than that of an

ASIC implementation. This approach has a number of advantages:

1. Its energy-delay efficiency is close to what can be achieved by a custom ASIC.

139

2. The design cycle is extremely short when compared to an ASIC since it substitutes

circuit design with interconnect topology selection and microcode programming.

3. The perception processor architecture is simple and regular. Hardware netlists for

perception processor configurations are automatically generated. Once the netlist

generator and the basic architectural components are proven to be correct, percep-

tion processor configurations should be easier to implement correctly compared to

ASICs. The perception processor architecture provides very fine grain control over

hardware resources making work arounds for hardware problems and software bug

fixes easy.

4. Since applications are implemented in microcode, post deployment bug fixes are

trivial.

5. It retains a large amount of generality compared to an ASIC.

6. It is well suited for rapid automated generation of domain specific processors.

A larger set of applications needs to be analyzed in the future to ensure that the ar-

chitectural primitives of the perception processor have sufficient generality to cover the

perception domain comprehensively. Automated architecture exploration and application

analysis, programming language support for perceptual primitives and streaming, and

formal methods to ensure real-time response will be important directions for future

research.

It has been shown that fine-grained management of communication and storage re-

sources can improve performance and reduce energy consumption whereas simultane-

ously improving on both these axes using a traditional microprocessor approach has been

problematic. The perception processor is an attractive choice when performance, power

efficiency, programmability and rapid design cycles are important. For the first time,

sophisticated real-time perception applications appear to be possible within an energy

budget that is commensurate with the embedded space.

CHAPTER 12

FUTURE RESEARCH

The architecture of the perception processor presented in this dissertation gradually

evolved from analyzing and observing the characteristics of speech recognition and vi-

sion algorithms and trying to design ASICs and traditional processors to accelerate these

tasks. The design process has led to the realization that it may be possible to systemati-

cally derive power efficient high performance processors for a wider class of algorithms.

This chapter outlines possible directions for future extensions to the perception processor

architecture. In this chapter, the term stream processor refers to the extended version of

the architecture so as to clearly distinguish it from the perception processor presented in

Chapter 9.

The term stream processing refers to real-time computations on high bandwidth data

streams. Examples include link-level encryption in networks, video trans-coding and

compression of video streams. Perceptual algorithms tend to be stream oriented. An im-

portant direction for future research is the architecture of generic, high performance, low

power, stream processors that can accelerate both perception algorithms and streaming

algorithms from other domains.

Figure 12.1 shows an abstract representation of a stream function. It is a generaliza-

tion of the map(), reduce() and filter() list processing functions and list comprehensions

found in the Python and Haskell languages [101, 54]. Analogues exist in Lisp and similar

languages. It applies a side effect free function lambda func() to arguments gathered

from a set of input variables and stores the result to a set of output variables. The input

and output variables may be scalars, vectors, multidimensional arrays or more complex

aggregates. The procedure input iterator() is history sensitive. Each time it is invoked,

it returns a tuple consisting of input data gathered from the various input variables. The

141

StreamFunc(input_iterator, input_predicate,output_iterator, output_predicate,lambda_func) -> output_data

input_iterator() -> input_tupleinput_predicate(input_tuple) -> true|falselambda_func(input_tuple) -> output_tupleoutput_predicate(output_tuple) -> true|falseoutput_iterator(output_tuple) /* Stores output_tuple */

Figure 12.1. Generic Stream Function

input predicate() function examines the input tuple gathered by the iterator and decides

if further processing is required or not. If further processing is required, lambda func()

is used to transform the input tuple to an output tuple. The function output predicate()

examines an output tuple and decides if it needs to be saved or not. If the result needs to

be saved, the history sensitive output iterator() procedure scatters the output tuple over

the output variables. Complex streaming algorithms may be expressed as the composi-

tion of several StreamFunc() instantiations with the outputs of earlier instances used as

the input of later instances. Some restrictions like constant dependence distance or flow

dependence may need to be imposed to map such functions onto stream processors with

limited on-chip memory.

The structure of StreamFunc() lends itself to a highly parallel hardware implemen-

tation. Figure 12.2 shows the logical organization of a generic stream processor. Its

architecture is reminiscent of a hydraulic system and fluid flow analogies apply to the

throughput of the system. The input iterator unit pumps or gathers data from a set of

SRAMs. The input predicate examines the data and either passes it to the execution

cluster or drops it. The execution cluster constantly transforms the data being pumped

into it. The output predicate then examines the transformed results and either drops it

or passes it on to the output iterator which saves it to output memory. The structure is

highly parallel and capable of sustaining high throughput. The gathering, transformation

and scattering of data are staged under the control of microcode.

142

Figure 12.2. Stream Processor

The perception processor described in Chapter 9 is less generic when compared to

this stream processor. The input and output iterator functionality is provided by the

Loop Unit and the Address Generators, but they are limited to accelerating simple nested

for loops as well as array and vector accesses. A stream processor needs high perfor-

mance but generic mechanisms for implementing more complex loop nests and data

access patterns. The perception processor does not implement input or output predicates

though conditional moves in the execution cluster permit selection of alternative results.

In the perception processor hardware acceleration is limited to lambda functions that

correspond to the loop bodies of modulo schedulable loops. Other types of code may

be used, but with no significant advantage over what a normal VLIW processor might

provide. A generic stream processor may need to support complex lambda functions that

involve conditional execution and hardware acceleration for scheduling regimes other

than modulo scheduling. Like the perception processor, the stream processor will also

need to behave like a normal processor when operating outside the stream function so as

to efficiently implement loop prologues, epilogues and assorted processing that does not

143

fit the stream function model.

Research in scheduling algorithms that can produce good mappings for stream func-

tions on to stream processors with a specified configuration will be important from a

code generation perspective as well as for automated architecture exploration. Such

algorithms will need to perform both power and performance optimization as well as

ensure that parameters like supply current variation meet design constraints. Algorithms

for splitting and composing complex stream functions expressed as combinations of basic

stream functions so as to make the best use of the limited number of function units and

storage resources available in a particular stream processor configuration will also be

important.

The structure of perception applications is suitable for a pipeline of perception pro-

cessors. Stream processors should support more complex communication and synchro-

nization modes. Chapters 5 and 6 indicated that DRAM bandwidth reservation or in-

dependent DRAM buses for individual algorithmic phases may be required to ensure

adequate bandwidth for perception applications. Chapter 3 explained that the IPC im-

provement provided by thread level parallelism can be an important source of power

savings. Together these factors indicate that research into chip multiprocessors con-

sisting of clusters of stream and RISC processors, a stream optimized interconnect and

multiple DRAM buses could be extremely beneficial. Finally, tools to characterize the

global dataflow within complex applications, refactor applications to ease mapping on

to heterogeneous chip multiprocessors and programming language support for streams

could be important directions for future research.

REFERENCES

[1] Cognex Inc. http://www.cognex.com/, 2004.

[2] Coreco Inc. http://www.coreco.com/, 2004.

[3] AARTS, B., BARRETEAU, M., BODIN, F., BRINKHAUS, P., CHAMSKI, Z.,CHARLES, H.-P., EISENBEIS, C., GURD, J. R., HOGGERBRUGGE, J., HU,P., JALBY, W., KNIJNENBURG, P. M. W., O’BOYLE, M. F. P., ROHOU, E.,SAKELLARIOU, R., SCHEPERS, H., SEZNEC, A., STOHR, E., VERHOEVEN,M., AND WIJSHOFF, H. A. G. OCEANS: Optimizing compilers for embeddedapplications. In European Conference on Parallel Processing (1997), pp. 1351–1356.

[4] ABNOUS, A., SENO, K., ICHIKAWA, Y., WAN, M., AND RABAEY, J. M. Evalu-ation of a low-power reconfigurable DSP architecture. In IPPS/SPDP Workshops(1998), pp. 55–60.

[5] ADVANCED MICRO DEVICES, I. AMD Athlon Processor x86 Code OptimizationGuide, k ed., Feb. 2002.

[6] AGARAM, K., KECKLER, S. W., AND BURGER, D. A characterization ofspeech recognition on modern computer systems. In Proceedings of the 4th IEEEWorkshop on Workload Characterization (Dec. 2001).

[7] AKTURAN, C., AND JACOME, M. F. FDRA: A software-pipelining algorithm forembedded VLIW processors. In Proceedings of the 13th International Symposiumon System Synthesis (2000), pp. 34–40.

[8] AKTURAN, C., AND JACOME, M. F. CALiBeR: A software pipelining algorithmfor clustered embedded VLIW processors. In Proceedings of the IEEE/ACMInternational Conference on Computer Aided Design (2001), pp. 112–118.

[9] ALNUWEIRI, H. M., AND PRASANNA, V. K. Parallel architectures and algo-rithms for image component labelling. IEEE Transactions on Pattern Analysisand Machine Learning 14, 10 (Oct. 1992), 1014–1034.

[10] ANANTHARAMAN, T., AND BISIANI, R. A hardware accelerator for speechrecognition algorithms. In Proceeedings of the 13th International Symposium onComputer Architecture (June 1986).

[11] ASANOVIC, K. The Computer Engineering Handbook. CRC Press, Dec. 2001,ch. Vector Processors.

145

[12] ATHAS, W., YOUNGS, L., AND REINHART, A. Compact models for estimatingmicroprocessor frequency and power. In Proceedings of the 2002 internationalsymposium on Low power electronics and design (2002), ACM Press, pp. 313–318.

[13] BENEDETTI, A., AND PERONA, P. A novel system architecture for real-timelow-level vision. In Proceedings of the IEEE International Symposium on Circuitsand Systems (ISCAS) (1999), pp. 500–503.

[14] BERTRAN, A., YU, H., AND SACCHETTO, P. Face detection project re-port. http://ise.stanford.edu/2002projects/ee368/Project/reports/ee368group17.-pdf, 2002.

[15] BOAHEN, K. Retinomorphic chips that see quadruple images. In Microelectron-ics for Neural, Fuzzy and Bio-Inspired Systems, 1999. MicroNeuro ’99 (1999),pp. 12–20.

[16] BONA, A., SAMI, M., SCIUTO, D., SILVANO, C., ZACCARIA, V., AND

ZAFALON, R. Energy estimation and optimization of embedded VLIW proces-sors based on instruction clustering.

[17] BROOKS, D., TIWARI, V., AND MARTONOSI, M. Wattch: a framework forarchitectural-level power analysis and optimizations. In ISCA (2000), pp. 83–94.

[18] BUDIU, M., AND GOLDSTEIN, S. C. Fast compilation for pipelined reconfig-urable fabrics. In ACM/SIGDA International Symposium on Field ProgrammableGate Arrays (Monterey, CA, 1999), S. Kaptanoglu and S. Trimberger, Eds., ACMPress, pp. 195–205.

[19] BURGER, D., AND AUSTIN, T. M. The SimpleScalar tool set, version 2.0.SIGARCH Computer Architecture News 25, 3 (1997), 13–25.

[20] CALLAHAN, T., AND WAWRZYNEK, J. Adapting software pipelining for re-configurable computing. In Proceedings of the International Conference onCompilers, Architecture, and Synthesis for Embedded Systems (CASES) (San Jose,CA, 2000), ACM.

[21] CAMPBELL, M. Evaluating ASIC, DSP, and RISC architectures for embeddedapplications. In Proceedings of the ACM SIGPLAN Workshop on Languages,Compilers, and Tools for Embedded Systems (1998), Springer-Verlag, pp. 261–265.

[22] CAO, Y., SATO, T., SYLVESTER, D., ORSHANSKY, M., AND HU, C. Newparadigm of predictive MOSFET and interconnect modeling for early circuit de-sign. In Proceedings of the IEEE Custom Integrated Circuits Conference (CICC)(June 2000), pp. 201–204.

[23] CAO, Y., SATO, T., SYLVESTER, D., ORSHANSKY, M., AND HU, C. Predictivetechnology model. http://www-device.eecs.berkeley.edu/˜ptm, 2002.

146

[24] CAT, H. H., EBLE, J. C., WILLS, D. S., DE, V. K., BROOKE, M., , AND

JOKERST, N. M. Low power opportunities for a SIMD VLSI architecture in-corporating integrated optoelectronic devices. In Proceedings of GoMAC (Mar.1996).

[25] CONNELL, J. Face finding. http://www.research.ibm.com/ecvg/jhc proj/faces.-html, June 2002.

[26] CONTE, T. M., DUBEY, P. K., JENNINGS, M. D., LEE, R. B., PELEG, A.,RATHNAM, S., SCHLANSKER, M. S., SONG, P., AND WOLFE, A. Challengesto combining general-purpose and multimedia processors. IEEE Computer 30, 12(1997), 33–37.

[27] CORREALE, JR., A. Overview of the power minimization techniques employedin the IBM PowerPC 4xx embedded controllers. In Proceedings of the 1995international symposium on Low power design (1995), ACM Press, pp. 75–80.

[28] D. BOLME, R. BEVERIDGE, M. T., AND DRAPER, B. The CSU face identi-fication evaluation system: Its purpose, features and structure. In InternationalConference on Vision Systems (April 2003), pp. 304–311.

[29] DAEMEN, J., AND RIJMEN, V. The block cipher Rijndael. Smart Card Researchand Applications, LNCS 1820 (2000), 288–296.

[30] DAVID PALLETT, J. G. F., AND PRZYBOCKI, M. A. 1996 preliminary broadcastnews benchmark tests. In Proceedings of the 1997 DARPA Speech RecognitionWorkshop (Feb. 1997).

[31] DEHON, A. DPGA-coupled microprocessors: Commodity ICs for the early 21stcentury. In IEEE Workshop on FPGAs for Custom Computing Machines (LosAlamitos, CA, 1994), D. A. Buell and K. L. Pocek, Eds., IEEE Computer SocietyPress, pp. 31–39.

[32] DELANEY, B., JAYANT, N., HANS, M., SIMUNIC, T., AND ACQUAVIVA, A.A low-power, fixed-point front-end feature extraction for a distributed speechrecognition system. In Proceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) (2002) (2002).

[33] ECKSTEIN, E., AND KRALL, A. Minimizing cost of local variables access forDSP-processors. In LCTES’99 Workshop on Languages, Compilers and Tools forEmbedded Systems (Atlanta, 1999), Y. A. Liu and R. Wilhelm, Eds., vol. 34(7),pp. 20–27.

[34] FANG, W.-C. A system-on-chip design of a low-power smart vision system. InProceedings of the IEEE Workshop on Signal Processing Systems (1998), pp. 63–72.

[35] FARABOSCHI, P., BROWN, G., FISHER, J. A., DESOLI, G., AND HOMEWOOD,F. Lx: a technology platform for customizable VLIW embedded processing. In

147

The 27th Annual International Symposium on Computer architecture 2000 (NewYork, NY, USA, 2000), ACM Press, pp. 203–213.

[36] FARBER, P., AND ASANOVIC, K. Parallel neural network training on Multi-Spert. In Proceedings of Third IEEE International Conference on Algorithms andArchitectures for Parallel Processing (ICA3PP (Dec. 1997).

[37] FERRETTI, M. Multimedia extensions in super-pipelined microarchitectures.a new case for SIMD processing? In Fifth IEEE International Workshop onComputer Architectures for Machine Perception (2000), pp. 249–258.

[38] FRIGO, M., AND JOHNSON, S. G. FFTW: An adaptive software architecture forthe FFT. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing(Seattle, WA, May 1998), vol. 3, pp. 1381–1384.

[39] GONZALEZ, R., AND HOROWITZ, M. Energy dissipation in general purposemicroprocessors. IEEE Journal of Solid-State Circuits 31, 9 (September 1996),1277–1284.

[40] GONZALEZ, R. E. Xtensa: A configurable and extensible processor. IEEE Micro20, 2 (March 2000), 60–70.

[41] GOWAN, M. K., BIRO, L. L., AND JACKSON, D. B. Power considerations inthe design of the Alpha 21264 microprocessor. In Design Automation Conference(1998), pp. 726–731.

[42] GRADY, T. Bit-reversed addressing in C on the C3x. In TMS320 DSP Designer’sNotebook, vol. SPRA204. Texas Instruments, 1992.

[43] HAGER, G. D., AND TOYAMA, K. X vision: A portable substrate for real-timevision applications. Computer Vision and Image Understanding: CVIU 69, 1(1998), 023–037.

[44] HAMMERSTROM, D. A VLSI architecture for high-performance, low-cost, on-chip learning. In International Joint Conference on Neural Networks (1990),pp. 537–544.

[45] HARRISON, R. R. An Analog VLSI Motion Sensor Based on the Fly VisualSystem. PhD thesis, California Institute of Technology, May 2000.

[46] HENNESSY, J., AND PATTERSON, D. Computer Architecture: A QuantitativeApproach, 3rd ed. Morgan Kaufmann, 2002.

[47] HOOGERBRUGGE, J., AND AUGUSTEIJN, L. Instruction scheduling for TriMe-dia. Journal of Instruction-Level Parallelism, 1(1) (Feb. 1999).

[48] HOOGERBRUGGE, J., CORPORAAL, H., AND MULDER, H. MOVE: a frame-work for high-performance processor design. In Proceedings of the 1991ACM/IEEE conference on Supercomputing (1991), ACM Press, pp. 692–701.

148

[49] HUANG, X., ALLEVA, F., HON, H.-W., HWANG, M.-Y., LEE, K.-F., AND

ROSENFELD, R. The SPHINX-II speech recognition system: an overview.Computer Speech and Language 7, 2 (1993), 137–148.

[50] INTEL CORPORATION. Using streaming SIMD extensions 2 (SSE2) to evaluatehidden Markov model with Viterbi decoding. Tech. Rep. AP-946, Intel Corpora-tion, 2000.

[51] INTEL CORPORATION. Intel Pentium 4 Processor Optimization Reference Man-ual, 2002.

[52] INTEL CORPORATION. Open source computer vision library. http://www.intel-.com/research/mrl/research/opencv/, 2002.

[53] JOHNSON, M. C., SOMASEKHAR, D., AND ROY, K. Leakage control withefficient use of transistor stacks in single threshold CMOS. In Proceedings ofthe 36th ACM/IEEE conference on Design automation conference (1999), ACMPress, pp. 442–445.

[54] JONES, S. P. Haskell 98 Language and Libraries. Cambridge University Press,Cambridge, UK, 2003.

[55] JOSHI, S. M. Some fast speech processing algorithms using Altivec technology.In Proceedings of the IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP) (Mar. 1999), pp. 2135 – 2138.

[56] KARL, W. Some design aspects for VLIW architectures exploiting fine - grainedparallelism. In Parallel Architectures and Languages Europe (1993), pp. 582–599.

[57] KLEIHORST, R., ABBO, A., VAN DER AVOIRD, A., OP DE BEECK, M., SEVAT,L., WIELAGE, P., VAN VEEN, R., AND VAN HERTEN, H. Xetal: A low-powerhigh-performance smart camera processor. In The IEEE International Symposiumon Circuits and Systems, (ISCAS) (2001), pp. 215–218.

[58] KRASHINSKY, R. Microprocessor energy characterization and optimizationthrough fast, accurate, and flexible simulation. Master’s thesis, MassachusettsInstitute of Technology, May 2001.

[59] LAI, C., LU, S.-L., AND ZHAO, Q. Performance analysis of speech recogni-tion software. In Proceedings of the Fifth Workshop on Computer ArchitectureEvaluation using Commercial Workloads (Feb. 2002).

[60] LAPINSKII, V., JACOME, M., AND DE VECIANA, G. Application-specificclustered VLIW datapaths: early exploration on a parameterized design space.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems21, 8 (Aug. 2002), 889–903.

[61] LEE, C., LEE, J. K., HWANG, T., AND TSAI, S.-C. Compiler optimization oninstruction scheduling for low power. In Proceedings of the 13th InternationalSymposium on System Synthesis (ISSS’00) (2000), IEEE Computer Society, p. 55.

149

[62] LEE, W., BARUA, R., FRANK, M., SRIKRISHNA, D., BABB, J., SARKAR, V.,AND AMARASINGHE, S. Space-time scheduling of instruction-level parallelismon a raw machine. In Proceedings of the International Conference on Architec-tural Support for Programming Languages and Operating Systems (1998), ACMPress, pp. 46–57.

[63] LEUPERS, R. Instruction scheduling for clustered VLIW DSPs. In Proceedingsof the International Conference on Parallel Architectures and Compilation Tech-niques (PACT) (Oct. 2000), pp. 291–300.

[64] MARTIN, A. J., NYSTROEM, M., AND PENZES, P. ET2: A metric for time andenergy efficiency of computation. Tech. Rep. CaltechCSTR:2001.007, CaltechComputer Science, 2001.

[65] MATHEW, B., DAVIS, A., AND EVANS, R. A characterization of visual featurerecognition. In Proceedings of the IEEE 6th Annual Workshop on WorkloadCharacterization (WWC-6) (October 2003), pp. 3–11.

[66] MATHEW, B., DAVIS, A., AND FANG, Z. A low-power accelerator for theSphinx 3 speech recognition system. In Proceedings of the International Con-ference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’03) (October 2003), pp. 210–219.

[67] MATHEW, B., DAVIS, A., AND IBRAHIM, A. Perception coprocessors forembedded systems. In Proceedings of the Workshop on Embedded Systems forReal-Time Multimedia (ESTIMedia) (October 2003), pp. 109–116.

[68] MCVOY, L. W., AND STAELIN, C. lmbench: Portable tools for performanceanalysis. In USENIX Annual Technical Conference (1996), pp. 279–294.

[69] MEMIK, S., BOZORGZADEH, E., KASTNER, R., AND SARRAFZADEH, M.SPS: A strategically programmable system. In Proceedings of the ReconfigurableArchitectures Workshop (RAW) (Apr. 2001).

[70] MEMIK, S. O., BOZORGZADEH, E., KASTNER, R., AND SARRAFZADE, M. Asuper-scheduler for embedded reconfigurable systems. In Proceedings of the In-ternational Conference on Computer-Aided Design (ICCAD) (Nov. 2001), p. 391.

[71] MIPS TECHNOLOGIES, INC. MIPS R4000 Microprocessor User’s Manual,Second Edition, April 1993.

[72] MODULE RESEARCH CENTER. NeuroMatrix NM6403 digital signal processor.Tech. Rep. 431282.001D2, Module Research Center, 2000.

[73] MORETTO, P. Mapping of speech front-end signal processing to high per-formance vector architectures. Tech. Rep. TR-95-063, International ComputerScience Institute, University of California at Berkeley, 1995.

[74] MOSUR, R. Efficient Algorithms for Speech Recognition. PhD thesis, CarnegieMellon University, May 1996. CMU-CS-96-143.

150

[75] PENTLAND, A. Looking at people: Sensing for ubiquitous and wearable comput-ing. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 22,1 (Jan. 2000), 107–118.

[76] PERING, T., AND BRODERSON, R. Dynamic voltage scaling and the designof a low-power microprocessor system. In Proceedings of the InternationalSymposium on Computer Architecture ISCA’98 (June 1998).

[77] PIHL, J., SVENDSEN, T., AND JOHNSEN, M. H. A VLSI implementation of PDFcomputations in HMM based speech recognition. In Proceedings of the IEEERegion Ten Conference on Digital Signal Processing Applications (TENCON’96)(Nov. 1996).

[78] POWELL, M., YANG, S.-H., FALSAFI, B., ROY, K., AND VIJAYKUMAR,T. N. Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cachememories. In Proceedings of the 2000 International Symposium on Low PowerElectronics and Design (2000), ACM Press, pp. 90–95.

[79] RABINER, L., AND JUANG, B.-H. Fundamentals of Speech Recognition. Pren-tice Hall, 1993, ch. 9, p. 494.

[80] RABINER, L. R. A tutorial on hidden Markov models and selected applicationsin speech recognition. Proceedings of the IEEE 77, 2 (Dec. 1989), 257–286.

[81] RAU, B. R. Iterative modulo scheduling: an algorithm for software pipeliningloops. In Proceedings of the 27th Annual International Symposium on Microar-chitecture (1994), ACM Press, pp. 63–74.

[82] RIXNER, S., DALLY, W. J., KAPASI, U. J., KHAILANY, B., LOPEZ-LAGUNAS,A., MATTSON, P. R., AND OWENS, J. D. A bandwidth-efficient architecture formedia processing. In Proceedings of the 31st Annual ACM/IEEE InternationalSymposium on Microarchitecture (MICRO-31) (Nov. 1998), pp. 3–13.

[83] ROWLEY, H. A., BALUJA, S., AND KANADE, T. Neural network-based facedetection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 1(1998), 23–38.

[84] RUSSELL, J., AND JACOME, M. Software power estimation and optimization forhigh performance, 32-bit embedded processors.

[85] RUSSELL, R. M. The CRAY-1 computer system. Communications of the ACM21, 1 (1978), 63–72.

[86] SCHAPIRE, R. E. The boosting approach to machine learning: An overview. InIn MSRI Workshop on Nonlinear Estimation and Classification (2002).

[87] SCHMIT, H., WHELIHAN, D., TSAI, A., MOE, M., LEVINE, B., AND TAYLOR,R. Piperench: a virtualized programmable datapath in 0.18 micron technology. InProceedings of the IEEE Custom Integrated Circuits Conference (2002), pp. 63–66.

151

[88] SMITH, J. E. Decoupled access/execute computer architectures. In Proceedingsof the 9th annual symposium on Computer Architecture (1982), IEEE ComputerSociety Press, pp. 112–119.

[89] SMITH, M. D., LAM, M., AND HOROWITZ, M. A. Boosting beyond staticscheduling in a superscalar processor. In Proceedings of the 17th Annual Sym-posium on Computer Architecture (1990), pp. 344–354.

[90] SORIANO, M., MARTINKAUPPI, B., HUOVINEN, S., AND LAAKSONEN, M.Using the skin locus to cope with changing illumination conditions in color-basedface tracking. In Proceedings of the IEEE Nordic Signal Processing Symposium(2000), pp. 383–386.

[91] SRIVASTAVA, S. Fast gaussian evaluations in large vocabulary continuous speechrecognition. M.S. Thesis, Department of Electrical and Computer Engineering,Mississippi State University, Oct. 2002.

[92] STERN, R. M. Specification of the 1996 HUB 4 broadcast news evaluation.http://www.nist.gov/speech/publications/darpa97/pdf/stern1.pdf, 1996.

[93] SUNDARARAJAN, V., AND PARHI, K. K. Low power synthesis of dual thresholdvoltage CMOS VLSI circuits. In Proceedings of the 1999 international sympo-sium on Low power electronics and design (1999), ACM Press, pp. 139–144.

[94] TEXAS INSTRUMENTS. TMS320C6000 CPU and Instruction Set ReferenceGuide, spru189f ed., Oct. 2000.

[95] TIWARI, V., MALIK, S., WOLFE, A., AND LEE, M. Instruction level poweranalysis and optimization of software. In Proceedings of the Ninth InternationalConference on VLSI Design (Jan. 1996), pp. 326–328.

[96] TIWARI, V., SINGH, D., RAJGOPAL, S., MEHTA, G., PATEL, R., AND BAEZ,F. Reducing power in high-performance microprocessors. In Proceedings of the35th Annual Design Automation Conference (1998), ACM Press, pp. 732–737.

[97] TONG, Y. F., RUTENBAR, R., AND NAGLE, D. Minimizing floating-point powerdissipation via bit-width reduction. In Proceedings of the 1998 InternationalSymposium on Computer Architecture Power Driven Microarchitecture Workshop(1998).

[98] TSENG, J. H., AND ASANOVIC, K. Energy-efficient register access. InProceedings of the 13th Symposium on Integrated Circuits and Systems Design(SBCCI’00) (2000), IEEE Computer Society, p. 377.

[99] TURK, M., AND PENTLAND, A. Face recognition using Eigenfaces. In Proceed-ings of the IEEE Computer Society Conference on Computer Vision and PatternRecognition (CVPR) (June 1991), pp. 586–591.

[100] UNGER, S., AND MUELLER, F. Handling irreducible loops: Optimized nodesplitting vs. DJ-graphs. Lecture Notes in Computer Science 2150 (2001), 207+.

152

[101] VAN ROSSUM, G. Python Reference Manual, 2.3.3 ed., Dec. 2003.

[102] VERMA, A., FARUQUIE, T., NETI, C., BASU, S., AND SENIOR, A. Lateintegration in audio-visual continuous speech recognition. In Proceedings of theAutomatic Speech Recognition and Understanding Workshop (ASRU) (1999).

[103] VIOLA, P., AND JONES, M. Rapid object detection using a boosted cascade ofsimple features. In IEEE Computer Society Conference on Computer Vision andPattern Recognition (Dec. 2001).

[104] WAINGOLD, E., TAYLOR, M., SRIKRISHNA, D., SARKAR, V., LEE, W., LEE,V., KIM, J., FRANK, M., FINCH, P., BARUA, R., BABB, J., AMARASINGHE,S., AND AGARWAL, A. Baring it all to software: Raw machines. IEEE Computer30, 9 (1997), 86–93.

[105] WANG, C.-L., BHAT, P. B., AND PRASANNA, V. K. High performance comput-ing for vision. Proceedings of the IEEE 84, 7 (July 1996), 931–946.

[106] WAWRZYNEK, J., ASANOVIC, K., KINGSBURY, B., BECK, J., JOHNSON, D.,AND MORGAN, N. SPERT-II: A vector microprocessor system and its applicationto large problems in backpropagation training. In Advances in Neural InformationProcessing Systems (1996), D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo,Eds., vol. 8, The MIT Press, pp. 619–625.

[107] WEEMS, C. C. The second generation image understanding architecture andbeyond. In Proceedings of Computer Architectures for Machine Perception (Nov.1993), pp. 276–285.

[108] WEISS, M., AND FETTWEIS, G. Dynamic codewidth reduction for VLIWinstruction set architectures in digital signal processors, 1996.

[109] WESTE, N. H. E., AND ESHRAGHIAN, K. Principles of CMOS VLSI Design, ASystems Perspective, second ed. Addison Wesley, 1993.

[110] YANG, M.-H., KRIEGMAN, D., AND AHUJA, N. Detecting faces in images: Asurvey. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)24, 1 (2002), 34–58.

[111] YOUNG, S. Large vocabulary continuous speech recognition: A review. InProceedings of the IEEE Workshop on Automatic Speech Recognition and Un-derstanding (Dec. 1995), pp. 3–28.

[112] YUN, H.-S., AND KIM, J. Power-aware modulo scheduling for high-performance vliw processors. In Proceedings of the 2001 International Sympo-sium on Low Power Electronics and Design (2001), ACM Press, pp. 40–45.

the perception processor - citeseer

Documents