implementation of the cochlea model in vlsimira/thesis/udi_shtalrid_thesis.pdf · 2005. 5. 22. ·...

TEL AVIV UNIVERSITY

The Iby and Aladar Fleischman Faculty of Engineering

The Zandman-Slaner School of Graduate Studies

IMPLEMENTATION OF THE COCHLEA

MODEL IN VLSI

A thesis submitted toward the degree of

Master of Science in Electrical and Electronic Engineering

by

Udi Shtalrid

May 2005

TEL AVIV UNIVERSITY

The Iby and Aladar Fleischman Faculty of Engineering

The Zandman-Slaner School of Graduate Studies

IMPLEMENTATION OF THE COCHLEA

MODEL IN VLSI

A thesis submitted toward the degree of

Master of Science in Electrical and Electronic Engineering

by

Udi Shtalrid

This research was carried out in the Department of Electrical Engineering - Systems

under the supervision of Prof. Miriam Furst Yust

May 2005

ii

Table of Contents

Table of Contents iii

List of Tables v

List of Figures vii

Abstract ix

Acknowledgements xi

Introduction 1

1 The Ear: Anatomy and Model 5

1.1 Structure of the Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 The Development of Models . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Related Hardware Model Implementations . . . . . . . . . . . . . . . 12

1.4 Motivation of the Present Study . . . . . . . . . . . . . . . . . . . . . 14

2 The Model Description 15

2.1 Cochlear Fluid Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Existing Software Solution . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.1 The model’s equations . . . . . . . . . . . . . . . . . . . . . . 21

2.2.2 The software algorithm solution . . . . . . . . . . . . . . . . . 23

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 The Hardware Model 29

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Determination of the time step size and spatial resolution . . . . . . . 31

3.3 The Hardware Model Description . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Eunit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

iii

3.3.2 Gunit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.3 Punit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.4 Dunit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.5 MEunit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4 Computational Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5 Pipeline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.6 Delta Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Evaluation of the Hardware Algorithm 57

4.1 Punit Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Results for Different Configurations . . . . . . . . . . . . . . . . . . . 62

4.3 Determining the Variables’s Presentation . . . . . . . . . . . . . . . . 67

4.4 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4.1 Fast Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4.2 High Speed Multiplication . . . . . . . . . . . . . . . . . . . . 78

4.5 Power Consumption Analysis . . . . . . . . . . . . . . . . . . . . . . 83

5 FPGA Design and Simulation 87

5.1 The Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6 Discussion 95

A List of Symbols and parameters 100

B Mathematical Methods 102

B.1 The Finite Difference Method . . . . . . . . . . . . . . . . . . . . . . 102

B.2 Initial condition problem numerical solution . . . . . . . . . . . . . . 105

B.2.1 Euler Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

B.2.2 Modified Euler Method . . . . . . . . . . . . . . . . . . . . . . 106

C Booth Recoding 108

D Delay Calculation 110

E FPGA Instruction Code 111

Bibliography 113

iv

List of Tables

3.1 Comparison of Eunit work-load and latency between software and

hardware implementations . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 List of original model Parameters . . . . . . . . . . . . . . . . . . . . 37

3.3 Comparison of Gunit work-load and latency between software and


3.4 Comparison of Punit work-load and latency between software and


3.5 Comparison of Dunit work-load and latency between software and


3.6 Comparison of MEunit work-load and latency between software and


3.7 Total work-load in the software model . . . . . . . . . . . . . . . . . 50

3.8 Total work-load in hardware model . . . . . . . . . . . . . . . . . . . 50

3.9 Total work-load in hardware model for 5× 3 configuration . . . . . . 51

3.10 The critical path in hardware model . . . . . . . . . . . . . . . . . . . 52

4.1 Synthetic input signals . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 The Hebrew words input signals . . . . . . . . . . . . . . . . . . . . . 60

4.3 The lower and upper bounds of the variables for the hardware model 71

4.4 Fix point representation for the delta model . . . . . . . . . . . . . . 73

4.5 Number of operation for critical path . . . . . . . . . . . . . . . . . . 76

4.6 Tadd and fadd for different hardware model configurations . . . . . . . 81

v

4.7 Power consumption for different hardware model configurations . . . 85

5.1 The contents of the FPGA Register Banks . . . . . . . . . . . . . . . 89

5.2 Number of instructions per unit in the FPGA design . . . . . . . . . 90

5.3 Relative error of C vs. VHDL implementations . . . . . . . . . . . . . 91

5.4 Amount of logic needed for an adder and multiplier in FPGA . . . . 93

A.1 List of symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

E.1 The Instruction code for FPGA Controller . . . . . . . . . . . . . . . 112

vi

List of Figures

1 The number of people suffering from hearing loss . . . . . . . . . . . 2

2 Hearing aid market penetration . . . . . . . . . . . . . . . . . . . . . 3

1.1 Human ear: The outer,middle and inner ear . . . . . . . . . . . . . . 6

1.2 A Lateral view of a chinchilla cochlea . . . . . . . . . . . . . . . . . . 7

1.3 Stylized mammalian cochlea . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Radial segment of the cochlea duct . . . . . . . . . . . . . . . . . . . 9

1.5 A scheme of the Organ of Corti . . . . . . . . . . . . . . . . . . . . . 10

2.1 Cochlear model geometry . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 An equivalent electrical circuit model of the outer-hair cell . . . . . . 19

2.3 Software design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Software convergence unit . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 MSE of ”tz” as a function of the time step size . . . . . . . . . . . . . 31

3.2 MSE of ”tz” as a function of the spatial resolution . . . . . . . . . . . 32

3.3 Hardware flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 The punit as a bottle-neck . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 The parallel punit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.6 Jacobi matrix convergence . . . . . . . . . . . . . . . . . . . . . . . . 45

3.7 Hardware flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.8 A Pipeline architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 An output example of the hardware model . . . . . . . . . . . . . . . 58

vii

4.2 Relative error for different time iterations . . . . . . . . . . . . . . . . 61

4.3 Relative error for different p iterations . . . . . . . . . . . . . . . . . 62

4.4 Relative error for different combinations of time and p iterations . . . 63

4.5 Relative error for different combinations of time and p iterations for

synthetic signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.6 Relative error for different configurations when noise is applied . . . . 65

4.7 The influence of the time step . . . . . . . . . . . . . . . . . . . . . . 66

4.8 Variables’s representation . . . . . . . . . . . . . . . . . . . . . . . . 68

4.9 Histogram of the basilar membrane velocity . . . . . . . . . . . . . . 69

4.10 Histograms of the basilar membrane acceleration . . . . . . . . . . . . 70

4.11 Relative error for different quantization of the model’s variables . . . 72

4.12 The system architecture . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.13 The processor architecture . . . . . . . . . . . . . . . . . . . . . . . . 75

4.14 Asic design uncoiled . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.15 Power consumption vs. relative error . . . . . . . . . . . . . . . . . . 85

5.1 FPGA design diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.2 A waver of the FPGA design . . . . . . . . . . . . . . . . . . . . . . . 90

5.3 The energy of the output signal . . . . . . . . . . . . . . . . . . . . . 92

B.1 Euler and Modified Euler approximation method . . . . . . . . . . . 107

viii

Abstract

A one-dimensional cochlear model with embedded outer hair cells (OHC) was recently

developed by Cohen and Furst [4]. The cochlear’s model output is used to reconstruct

speech signals and improve the signal-to-noise ratio (Weisz and Furst [37]). The

reconstructed speech signals can be used in various applications such as hearing aids,

cellular communication and voice recognition where we seek an improvement in the

signal-to-noise ratio. It is the purpose of this study to test whether the cochlear model

can be used as a significant preprocessor for different speech analysis system.

The cochlear model software solution is unsuitable for a real-time application. It

requires massive computations, which are power consuming, and a long computational

latency. Its main equation is solved serially, being the bottleneck of the algorithm.

Hardware implementation is attractive when power-efficiency and real-time per-

formance are a design consideration. It may have orders of magnitude of improvement

in performance.

The cochlear model algorithm was investigated and modified to fit parallel and

pipeline architectures for hardware design, in order to reduce the computational la-

tency and the amount of computations. The serial solution was converted to an

iterative parallel solution that made real-time performance feasible. We have found

a basic parallel configuration obtaining a relative error of less than 1% compared to

the original algorithm on a set of tested stimuli. All those configurations have shown

that the cochlear model can be implemented in real-time, with a clock frequency

ranging between 10 to 250 MHz, and with reasonable energy consumption ranging

between 0.06 to 1.13 Watts, according to a specific configuration. Other hardware

ix

x

simplifications were tested, such as the determination of the necessary wordlengths

for a transfer of the floating point version of the model into a reduced floating point

version and fix point representation. The modified algorithm was evaluated and val-

idated against the original one using synthetic input signals and a set of recorded

Hebrew words.

The hardware model algorithm was written in VHDL and implemented on a FPGA

simulator. An actual architecture was planned for an ASIC design.

Acknowledgements

I would like to thank my fellows at the Auditory Signal Processing Laboratory: Azi

Cohen, Oren Bahat, Vered Weisz, Tomer Goshen, Nir Fink, and especially Ronen

Akerman, for his help in this research. I would like to thank my friend Maor Shitrit

for his help in programming.

I am grateful to my family for their support and love.

Finally, I would like to thank my advisor Professor Miriam Furst for providing

guidance and keeping me on track.

The work presented in this thesis was supported by ”RAMOT”, Tel-Aviv Univer-

sity.

Udi Shtalrid

May, 2005

xi

Introduction

Hearing is one of our greatest gifts. The auditory sense in mammals, particulary in

humans, is capable of almost unbelievable feats. The ear does an exquisite job of

transforming acoustic signals, varying enormously in amplitude and waveform, into a

regimented neural code. With hearing, we monitor our environment in all directions,

communicate with one another, and listen to the music of instruments and babbling

children. The loss of this treasure can bring severe behavioral deficits as well untold

personal agony. A poignant example is provided in a moving letter from the 31 year

old Beethoven to his brothers, in which he attempts to describe his misery due to a

progressive hearing loss. Not only was he unable to appreciate performance of his

music, he found it difficult to communicate with other people and became a virtual

recluse [12].

Nobody knows the exact number of hearing-impaired people. Professor Adrian

Davis of the British MRC Institute of Hearing Research estimates that there were

440 million hearing-impaired people world-wide in 1995 [35]. He also predicts that

the total number of people suffering from hearing loss of more than 25 dB will exceed

560 million by 2005. In developing countries where people are more exposed to noisy

environments the numbers are twice as large. The estimation of the number of people

who will suffer from hearing loss of more than 25 dB is shown in Figure 1.

1

2

Figure 1: A forecast of the number of people suffering from hearing loss of more than25 dB made by Adrian Davis.

Current hearing-aids do not perform well in noisy background. They are very

helpful for severe and profound hearing impairment but not for most people who

suffer from mild-to-moderate hearing loss. People with mild-to-moderate hearing loss

suffer mainly from misunderstanding of speech in a noisy background.

A hearing-aid industry market tracking survey from 1984-2000 conducted by Sergei

Kochkin [35] shown in Figure 2, indicates that the hearing-aid market penetration

has been low. Only one out of five, who need hearing-aid device, will purchase it.

About 10 to 20 percent of the people, who hold a hearing-aid, abound them.

To understand how the ear can go wrong we must start by explaining the phys-

iology of hearing, and how it operates, when it is functioning normally. During the

last century, many experiments and studies pioneered by G. Von Bekesy in 1928,

which earned him a Nobel Prize, contributed to the construction of a computational

cochlear models of the auditory system. In recent years, a significant progress has

been made in understanding the contribution of the mammalian cochlear outer hair

3

Figure 2: Hearing aid market penetration survey conducted by Sergei Kochkin

cells (OHC) to the normal auditory signal processing. The outer hair cells which

act as local amplifiers were mathematically modeled. A creation of a reliable, good

cochlear model, which mimics the functionality of the inner ear correctly, is of great

importance. If we could mimic the inner ear well, then a hearing-aid instrument can

be developed for hearing-impaired people, where an electronic cochlea substitutes the

damaged human cochlea.

In the work of Cohen and Furst [4] a classical one-dimensional cochlear model has

been modified to include the OHC activity. This model is solved in the time domain

and it does not require any assumptions on the stationary of the input signal. The

model simulates audiograms for normal ears and ears with OHC loss. The model,

while characterizing phonal trauma by a random loss of outer hair cells along the

cochlear partition, succeed in explaining a well known phenomena, in which loss of

sensitivity of 4 kHz was found independently of the type of noise exposure. The out-

put of the cochlea model is a time-frequency matrix which represents the partition

4

velocities along the cochlea basilar membrane. We use the reconstruction algorithm

which was developed by Weisz and Furst [37] to re-synthesize the speech signal from

the cochlear representation. This algorithm includes an estimation of both the travel-

ing wave delays and the amplification factors for each of the cochlear partitions. The

reconstructed signal is obtained in the time domain as a weighted shifted sum of the

cochlear partition responses. The algorithm is based on applying a time-frequency

mask on the cochlear presentation, before reconstructing. The mask acquisition is

based on following the energy modulation across the cochlear partitions.

In this work, we modify the cochlear model algorithm presented in the work

of Cohen and Furst [4] and propose a special hardware algorithm to fit hardware

design. A digital implementation seems potentially suitable. However, because of

its inherent complexity and massive computations, the application of the auditory

model to system poses a significant engineering challenge. For most application,

such as hearing implants and hearing aids, system is constrained to be real-time, low

power, and low cost. Nowadays, it takes about 1000 times the real-time to compute

the cochlear model on a general purpose workstation. Therefore, an approach of

parallel and pipeline architecture is applied.

This thesis is organized as follows: We start by describing the anatomy and phys-

iology of the cochlea in chapter one followed by an explanation of its mathematical

equations and time domain solution of the cochlear model in chapter two. Chapter

three describes the proposed hardware model. The simulation results and analysis are

discussed in chapter four. Finally, chapter five describes a VHDL implementation of

the hardware model on a FPGA simulator and the conclusions are drown in chapter

six.

Chapter 1

The Ear: Anatomy and Model

In this chapter we review the anatomy of the ear. A profound understanding of the

cochlear anatomy enables researchers develop models that mimic the ear. We also

review related hardware model implementations and present the motivation for this

thesis.

1.1 Structure of the Ear

The mammalian ear is composed of three regions, the outer, middle and inner ear

regions as sketched in Figure 1.1. The outer region includes the pinna and external

canal. The pinna functions as a ”collecting horn”. It intercepts sound waves from the

free space and channels them via the external ear canal to the eardrum. The sound

pressure arriving at the eardrum is amplified at all frequencies, becoming greater

than 5.6 (15 dB) over almost a two-octave frequency range (2-6 kHz). The pinna

significantly modifies the incoming sound at medium and high frequencies.

The eardrum forms the boundary between the outer and the middle ear. The

resulting eardrum vibrations are transmitted through an air-filled middle ear by a

three-bone structure (ossicles) to a membrane covered opening in the bony wall of

5

6

Figure 1.1: A sketch of the human ear, displaying the outer, middle and inner earregions.

the spiral-shaped structure of the inner ear called the cochlea. This opening is called

the oval window and it forms the boundary between the middle and the inner ear.

The three ossicles are tiny, they are the smallest bones in the body. The transmission

sound energy through the middle ear, in humans, is most efficient at frequencies

between 0.5 to 4 kHz.

The inner ear, also called the cochlea, consists of a fluid filled duct coiled as a

snail shell or corkscrew. A photomicrograph of a partially dissected chinchilla cochlea

is shown in Figure 1.2. From the acoustic point of view, the curvature of the scalae

is negligible. The propagation of the sound waves in the cochlea is almost exactly as

it would be in a straight cochlea or an ”uncoiled” one. The mammalian cochlea is

illustrated in Figure 1.3 as an uncoiled cochlea, having a longitudinal, vertical and

7

Figure 1.2: Lateral view of a chinchilla cochlea with the bony shell removed. Arrowspoint to remnants of cochlear partition in the various turns. H, helicotrema; M,modiolus; OW, oval window; RW, round window; S, stapes; ST, scala tympani; SV,scala vestibuli [33].

radial dimensions.

The perilymphatic space has the shape of an elongated U, the top arm of which

is called scala vestibuli and the bottom arm which is called scala tympani. The space

between the two arms of the mammalian perilymphatic space is the endolymphatic

space, labeled scala media. An extremely thin Reissner’s membrane separates the

scala media from the scala vestibuli, and the cochlear partition, a flexible structure

that contains the sensory hair cells, separates the scala media and the scala tympani.

At the apical end is the helicotrema, a short duct connecting the two perilymphatic

scalae. Thus, when the stapes pushes the oval window inward, the U-shaped column of

perilymph is free to slide through its casing and push the round window outward. Such

movements result in pressure differences between both sides of the basilar membrane

8

Figure 1.3: Stylized mammalian cochlea, shown as if the cochlear partition werestraight [9].

causing the flexible cochlear partition to vibrate.

The region of the cochlea adjacent to the oval window is called the base and the

region farthest away from the stapes is appropriately named the apex. The basic

structure of the cochlear partition is shown in Figure 1.4. Forming the basic platform

of the cochlear partition is the basilar membrane, which is attached on one side to

the bony spiral lamina and on the farthest side to the spiral ligament. The basilar

membrane is narrower and thicker in the base than it is in the apex. These longitu-

dinal differences in the structure of the basilar membrane are presumed to account in

large part for the different resonant measured at different points along the cochlear

partition.

Resting on the basilar membrane is a small but complicated superstructure, known

as the Organ of Corti, which contains the sound-sensing cells. The tectorial membrane

extends from the lip of the spiral limbus to overlie the apical surface of the Organ of

Corti. An expanded view of the Organ of Corti is shown if Figure 1.5.

9

Figure 1.4: Radial segment of the cochlea duct, showing all three scalae and the basicdivisions of the cochlear partition.

The sound sensing cells are called hair cells because they appear to have tufts of

hairs, called stereocilia, protruding on their top. The hair cells are divided into inner

and outer hair cells. The inner hair cells form a single row running from base to apex

whereas the outer hair cells form up to five rows. In humans, there are about 3,500

inner hair cells, each with about 40 stereocilia and 15,000 outer hair cells, each with

140 stereocilia protruding from it. When the basilar membrane moves up and down,

a shearing motion is created, the tectorial membrane moves to the side relative to the

tips of the hair cells. As a result, the stereocilia of the hair cells move and rotate. The

movement of the stereocilia leads to flow of electrical current through the hair cells,

which leads to the generation of action potentials. These potentials give rise to nerve

spikes in the neurones of the auditory nerve. The inner hair cells act to transduce the

mechanical movement into neural activity whereas the outer hair cells change their

10

Figure 1.5: Typical Organ of Corti. 1, basilar membrane; 5, outer hair cells; 12,tectorial membrane; 15, bony spiral lamina; 20, inner hair cells

length and size due to these potentials. Thus, the outer hair cells effect the physical

properties of the basilar membrane presuming it accounts for better filtering along

the cochlear partition.

1.2 The Development of Models

The first recognized model of the cochlea was published by Helmholtz in 1862 [19]

in an appendix of ”On Sensation Of Tone”. Helmholtz linked the cochlea to a bank

of highly tuned resonators, which were selective for different frequencies, much like

a piano or a harp, with each resonator representing a different place on the basilar

membrane. The model he proposed was not very satisfying since many important

features were left out. The most important of which includes the cochlear fluid which

11

couples the mechanical resonators together. But, given the publication date, it is an

impressive contribution by this early great master of physics and psychophysics.

The next major contribution was made by Wegel and Lane [32], and stands in

a class of its own even today. The paper was the first to quantitatively describe

the details of the upward spread of masking, and propose a ”modern” model of the

cochlea. If Wegel and Lane had been able to solve their model’s equations, they would

have predicted cochlear traveling waves.

It was the experimental observations of the Hungarian researcher G. Von Bekesy,

starting in 1928 on human cadavers’ cochleae, which unveiled the physical nature

of the basilar membrane traveling wave [16]. Von Bekesy, found that the cochlea is

analogous to a ”dispersive” transmission line where different frequency components,

which make up the input signal, travel at different speeds along the basilar membrane,

thereby isolating those various frequency components at different places along the

basilar membrane. He properly named this dispersive wave a ”traveling wave”. He

observed the traveling wave using stroboscopic light in dead human cochlea at sound

levels well above the pain threshold, namely above 140 dB SPL. These high sound

pressure levels were required to obtain displacement levels that were observable under

his microscope. Von Bekesy’s pioneering experiments were considered so important

that in 1961 he received the Nobel prize.

Over the intervening years these experiments have been greatly improved, but

Von Bekesy’s fundamental observations of the traveling wave still stand. Today, we

find that the traveling wave has a more sharply defined location on the basilar mem-

brane for pure tone input than observed by Von Bekesy. In fact, according to mea-

surements made over the last 20 years, the response of the basilar membrane to a pure

12

tone can change in amplitude by more than five orders of magnitude per millimeter

of distance along the basilar membrane.

One of the most common models today are the transmission line models, also

called the one dimensional models. The one dimensional model is built from cascade

sections of inductors, capacitors and resistors, which represent the mass of the fluids

of the cochlea, partition resistance and stiffness, respectively.

Two [29] and Three [13] dimensional models where also introduced. The two

dimensional model argues that the long wave approximation is not fulfilled in the

region of maximum response of the membrane. The three dimensional model takes

into account that the pressure and fluid flow can vary across the width of the cochlea

partition. Both, the two and three dimensional models are more complex and involve

complicated mathematics, thus harder to solve numerically. The one dimensional

model simulations have gained more appreciation because they require less memory

and fewer computations than the two and three dimensional models, and yet are

successful in predicting large number of phenomena.

1.3 Related Hardware Model Implementations

The field of neuromorphic engineering has the long term objective of taking architec-

tures from our understanding of biological systems to develop novel signal processing

systems. There have been several implementations of electronic cochlea in VLSI

technology.

The first electronic cochlea model was implemented in analog VLSI. The elec-

tronic cochlea, first proposed by Lyon and Mead [30] was a cascade of biquadratic

13

filter sections which mimic the qualitative behavior of the human cochlea. The orig-

inal implementation was published in 1988 and used continuous time subthreshold

transconductance circuits to implement the cascade of 480 stages. In 1992, Watts et.

al. reported a 50-stage version with improved dynamic range, stability, matching and

compactness [36]. In addition, a switched capacitor cochlea filter was proposed by

Bor et. al. in 1996 [20]. Although touted for their low power consumption, analog

VLSI subthreshold circuits are fraught with difficulty due to variations in process and

temperature which affect the stability, accuracy and size of the filters.

In spite of these difficulties that plague analog VLSI, the amount of work done so

far in developing digital implementations has been scanty.

Several digital VLSI cochlea implementations were reported. Starting in 1992,

Summerfield and Lyon reported an application-specific integrated circuit (ASIC) im-

plementation which employed bit-serial second order filters [10]. In 1997, Lim et. al.

reported a VHDL-based pitch detection system which used first-order Butterworth

bandpass filters for cochlea filtering [34]. The hardware test of this design has not

been reported. Later in 1998, Brucke et. al. designed a VLSI implementation of a

speech preprocessor which used gammatone filter banks to mimic the cochlea [27].

The design was apparently submitted for fabrication, but test results of the actual

hardware have not been presented. Recently, Leong et. al. [28] presented an FPGA-

based implementation of Lyon and Mead’s electronic cochlea filter and its application

to a real-time cochleagram display. The filter was generated by a tool which takes

filter coefficients to compile an application-optimized design with arbitrary precision.

This implementation along with Brucke et. al. used fixed-point arithmetic and they

also explored tradeoffs between wordlength and precision.

14

All of these implementations are fairly simplistic and in most cases even their

target performance compares poorly with biological data. In this thesis we use the

enhanced cochlear model developed by Cohen and Furst [4], which also integrates the

outer hair cell function. This model is solved in the time domain unlike the solutions

of all the other hardware implementations which were solved in the frequency domain.

In all of the VLSI implementations developed, the output may be a cochleagram or

the gain of a specific frequency channel. Our model on the contrary, not only displays

a cochleagram but also takes a further step and uses a reconstruction algorithm on

the cochleagram to produce a reconstruct output signal.

1.4 Motivation of the Present Study

The newly developed cochlear model algorithm from the work of Cohen and Furst [4]

solves the equations of the cochlea for the input signal and displays a detailed

frequency-time domain representation of the output signal. The reconstruction algo-

rithm, which was developed by Weisz and Furst [37], uses this representation format

to reconstruct the output signal. Both algorithms compose a system which can be

used as a hearing-aid device as it mimics the functionality of the ear. We hope to

achieve better performance compared to other state-of-the-art speech enhancement

devices and hearing-aids available today.

The development of a new hearing-aid system must be planned to work in real-

time. Therefore, the feasibility of implementing the new algorithm as a real-time

application must be investigated. In this work we evaluate the cochlear model algo-

rithm and modify it to be more efficient for real-time and hardware implementation.

Parallel and pipeline architectures are proposed and verified.

Chapter 2

The Model Description

In this chapter the basic mathematics of the algorithm is explained. The model is

a one dimensional cochlear model with embedded outer hair cell model developed

by Cohen and Furst [4]. The model is analyzed for low level stimuli where it can

be treated as a linear model. The solution of the cochlear model equations was

implemented in software. The software algorithm solution is described.

2.1 Cochlear Fluid Dynamics

In the simple one-dimensional model (Zwislocki [21]; Zweig et al [17]; Viergever [26];

Furst and Goldstein [25]), the cochlea is considered as an uncoiled structure with

two fluid-filled rigid-walled compartments separated by an elastic partition. The

basic equations are obtained by applying fundamental physical principles such as

conservation of mass and the dynamics of deformable bodies.

Cohen and Furst [4] integrated the one dimensional cochlear model with the outer

hair cell model. These two models control each other through cochlear partition move-

ment and cochlear partition cross pressure variables. Figure 2.1 illustrates an uncoiled

cochlea approximated by two fluid-filled rigid-walled compartments separated by an

15

16

helicotrema

oval window

round window

basilar membrane

scala tympani

scala vestibuli

x

base apex

Figure 2.1: Cochlear model geometry

elastic partition.

In order to arrive at a mathematically tractable model, simplifying assumptions

are inevitable. An extensive mathematical one dimensional cochlear model can be

found in [26].

Let x be the longitudinal coordinate such that at the basal end x = 0 and at the

apical end x = ` where ` is the uncoiled cochlea length. Let t be the time variable.

Let Pv(x, t) and Pt(x, t) be the pressure through the scala vestibuli and through the

scala tympani, respectively.

The intermediate cannel between the scala vestibuli and the scala tympani is

named the scala media and is represented by the elastic partition. The vertical

displacement of the partition along the x dimension is denoted by ξbm(x, t). The fluid

velocity along the x dimension is Uv(x, t) and Ut(x, t) for the scala vestibuli and the

17

scala tympani, respectively.

The principle of conservation of mass yields the equations:

A∂Uv

∂x− β

∂ξbm

∂t= 0, (2.1.1)

A∂Ut

∂x+ β

∂ξbm

∂t= 0, (2.1.2)

where β(x) is the basilar membrane width and A(x) is the scalae cross section area.

Intuitively, the mass of perilymph compressed by the membrane vertically is pushed

horizontally to the neighboring cross section.

Both scalae tympani and vestibuli contain perilymph, which is assumed to be

incompressible and inviscid fluid. The motion equations for each scala using Newton’s

second law are written as:

∂Pv

∂x+ ρ

∂Uv

∂t= 0, (2.1.3)

∂Pt

∂x+ ρ

∂Ut

∂t= 0, (2.1.4)

where ρ is the perilymph density. The difference in the pressure between neighboring

sections is the force which pulls the mass of the perilymph of the section.

This set of equations is completed by the equation of motion of the cochlear

partition. The partition is, mechanically, a flexible structure embedded in a rigid

framework. It is assumed that the flexible part, the basilar membrane, and the

structure above it has point wise mechanical properties. This means that the velocity

at any point of the partition is related to the pressure difference across the partition

at that point only and not at neighboring points.

The pressure difference across the cochlear partition is defined as:

P = Pt − Pv (2.1.5)

18

The cochlear partition is regarded as a flexible boundary between scala tympani

and scala vestibuli, whose mechanical properties are describable in terms of point-wise

mass density, stiffness and damping. Thus at every point along the cochlear duct, the

partition’s velocity is driven by the pressure difference P across the partition. From

the conservation of mass principle we can derive the relationship between the fluid

velocity and the basilar membrane displacement ξbm.

Combining equations, Eq 2.1.1 -Eq 2.1.5, yields the differential equation for P :

∂2P

∂x2− 2ρβ(x)

A

∂2ξbm

∂t2= 0 (2.1.6)

and the boundary conditions:

P (x, t) = S(t) x = 0

P (x, t) = 0 x = ` (2.1.7)

where S(t) is the pressure difference at the stapes and input stimuli. Since the cochlea

stimuli starts from a rest condition, the initial value conditions ∀x ∈ [0, `] are:

ξbm(x, 0) = 0 (2.1.8)

ξbm

dt(x, 0) = 0

The model includes the pressure produced by the OHCs, Pohc, therefore:

Pbm = P + Pohc (2.1.9)

The third equation imitates the basilar membrane as an electrical transmission line.

Pbm(x, t) = m(x)∂2ξbm

∂2t+ r(x)

∂ξbm

∂t+ s(x)ξbm (2.1.10)

where m(x), r(x) and s(x) represent the basilar membrane mass, resistance, and stiff-

ness per unit area, respectively.

19

The complete model of the cochlea integrates the outer hair cell model. The OHC

membrane is divided into two regions, the apical part facing scala media and the

basolateral part embedded in the organ of corti. The basic outer hair cell model rep-

resents these two cell membrane segments as two parallel resistance and capacitance

circuits. Figure 2.2 represents an equivalent electrical circuit model for the OHC.

ψ0

Gb Cb

- ψ

¡¡µGa ¡

¡µCa

& %

Vsm

Figure 2.2: An equivalent electrical circuit model of the outer-hair cell

Changes in the outer hair cell length are controlled by the voltage change across

the outer hair cell basolateral membrane ψ. Solving the electrical circuit in Figure 2.2

yields a differential equation for ψ [11]:

dψ

dt+ ωohcψ = λ(

dCa

dt+ Ga) + ωohcψ0 (2.1.11)

where Ca and Ga are the capacitance and conductance of the apical part, respectively.

ωohc and λ are defined as:

ωohc = Ga+Gb

Ca+Cb≈ Gb

Cb= const. = 2π · 1000

λ = Vsm

Cb+Ca≈ Vsm

Cb= const.

The capacitance Ca and conductance Ga of the apical part are affected by the stere-

ocilia movement. They undergo changes due to active opening of ion channels in the

apical part of the outer cell. The outer hair cell stereocilia are shallowly but firmly

20

embedded in the under-surface of the tectorial membrane. Since the tectorial mem-

brane is attached on one side to the basilar membrane, a sheer motion arises between

the tectorial membrane and the organ of corti as the basilar membrane moves up and

down (Pickles [22]). The model assumes Ga and Ca are functions of ξbm(the basilar

membrane vertical displacement).

The voltage variation across the basolateral part of the OHC causes a length

change (∆lOHC) in the OHC. Thus, the force FOHC that an OHC exhibits due to

voltage change is derived by

Fohc = Kohc(∆`ohc(ψ) + ξbm) (2.1.12)

The pressure that the OHCs contribute to the basilar membrane pressure is derived

from,

Pohc = γ(x)Fohc (2.1.13)

where γ(x) is the relative density of healthy OHCs per unit area along the cochlear

duct. γ(x) is referred to as the OHC gain, whose value ranges from 0 to 1.

When linear dependencies are assumed, i.e.,

GA ∝ (ξbm),

CA ∝ (ξbm),

∆lOHC ∝ (ψ)

(2.1.14)

and by the substitution of the linear assumptions 2.1.14 in equations 2.1.11 , 2.1.12

and 2.1.13 we derive the differential equation for Pohc, [11], [4]:

dPohc

dt+ ωohcPohc = γ(x)

[α2

dξbm

dt+ α1ξbm

](2.1.15)

where the values of α1(x) and α2(x) are:

α1(x) = − r(x)s(x)m(x)

α2(x) = r(x)ωohc

21

2.2 Existing Software Solution

In this section we summarize the cochlear model equations and describe the existing

software solution in the time domain.

2.2.1 The model’s equations

The cochlea model is described by three equations.

The pressure difference P along the cochlear partition is computed by Eq: 2.1.6,

∂2P

∂x2− 2ρβ(x)

A

∂2ξbm

∂t2= 0

with the boundary conditions as stated in Eq: 2.1.7,

P (x, t) = S(t) x = 0

P (x, t) = 0 x = `

The second equation describes the basilar membrane as an electrical transmission

line. The equation according to Eq: 2.1.10 is:

Pbm(x, t) = m(x)∂2ξbm

∂2t+ r(x)

∂ξbm

∂t+ s(x)ξbm

with the following initial values ∀x ∈ [0, `] :

ξbm(x, 0) = 0∂ξbm

∂t(x, 0) = 0

The third and last equation imitates the outer hair cells behavior. The equation

developed in Eq: 2.1.15:

dPohc

dt+ ωohcPohc = γ(x)

[α2

dξbm

dt+ α1ξbm

]

The contribution of the pressure generated by the outer hair cells was given in

Eq: 2.1.9,

Pbm = P + Pohc

22

When substituting Pbm in the second equation (Eq: 2.1.10) we get:

P + Pohc = m(x)∂2ξbm

∂2t+ r(x)

∂ξbm

∂t+ s(x)ξbm (2.2.1)

The velocity of the basilar membrane is defined as:

vbm =∂ξbm

∂t(2.2.2)

so we can rewrite eq: 2.2.1 and get the expression for the membrane acceleration:

v′bm =1

m[P + Pohc − rvbm − sξbm] (2.2.3)

Substituting the acceleration expression Eq: 2.2.3 in the model’s first equation,

∂2P

∂x2− 2ρβ(x)

A

∂2ξbm

∂t2= 0

yields:

∂2P

∂x2− 2ρβ

mA[P + Pohc − rvbm − sξbm] = 0 (2.2.4)

We define Q(x) as a function of the spatial variable x,

Q(x) =2ρβ

m(x)A(2.2.5)

and,

G(x, t) = Pohc − r(x)vbm − s(x)ξbm (2.2.6)

Substituting Eq: 2.2.5 and Eq: 2.2.6 in Eq: 2.2.4 yields,

∂2P

∂x2−QP = QG (2.2.7)

23

2.2.2 The software algorithm solution

In this subsection we describe the solution of the cochlear model algorithm as im-

plemented in software. We divide the solution into units. Each unit is responsible

for solving a different equation. The description is given in high level in order to

understand the flow of the solution. We analyze the software solution in the next

chapter where we introduce the hardware solution and compare it to the software.

The time domain model equations are solved numerically. Two of the model’s

equations are initial value condition problem and one equation is a boundary value

problem. The solution is performed in two sequential steps [25]. The initial value

condition problem is solved by an iterative method and the boundary value problem is

solved by the finite difference method using a variation of LU-decomposition method.

The existing software solution is divided into units as seen in figure 2.3.

All of the model’s variables are calculated for each time step. The time variable

step size is defined as ht. In the software algorithm ht is not a fixed number. The

spatial variable step size is hx = l/N , where l is the basilar membrane length and N

is the number of sections. Each point along the basilar membrane is marked by xi.

The algorithm solution starts by assuming we know the following variables Pohc, vbm

and ξbm for a particular time t = T for every xi . At t = 0 the variables are initialized

according to the initial conditions to zero. We approximate these variables for the

next time point at t = T + ht using the Euler method (Appendix B.2.1):

ξbm(x, t) = ξbm(x, T ) + ht × vbm(x, T ),

vbm(x, t) = vbm(x, T ) + ht × v′bm(x, T ),

Pohc(x, t) = Pohc(x, T ) + ht × P ′ohc(x, T )

24

Eunit

Gunit

Punit

Dunit

MEunit

Cunit

first iteration ?

converge ?

yes

no

no yes

Input block

Output block

T i t

e r a t

i o n s

Software Design Flow

Figure 2.3: Software design flow

These calculations are done in the Eunit .

The following unit is the Gunit . We calculate G(x, t) since we need it for the

next unit which solves the boundary value problem. The G vector is calculated using

eq: 2.2.6:

G(x, t) = Pohc(x, t)− r(x)vbm(x, t)− s(x)ξbm(x, t)

The next unit solves the boundary differential equation, we call it Punit . We

find an approximation to the pressure difference P for every nodal point xi for the

25

time t = T + ht. The boundary differential equation is described by eq: 2.2.7:

∂2P

∂x2−QP = QG

with the boundary condition:

P (x, t) = S(t) x = 0

P (x, t) = 0 x = `

This differential equation is represented as linear set of equations, AP = B, where,

P =

P0

P1

...

PN−1

PN

B =

S(T )

0...

0

0

+ h2x

0

G1Q1

...

GN−1QN−1

0

and the matrix A is:

A =

1 0

1 −(2 + h2xQ1) 1

. . . . . . . . .

1 −(2 + h2xQN−1) 1

0 1

This linear set of equations is solved by LU-decomposition (see Appendix B.1). The

LU-decomposition method is a good analytical solution for tridiagonal matrix condi-

tion problem. The matrix A is rewritten as A = LU where:

L =

α0

1 α1

. . . . . .

1 αN−1

1 αN

U =

1 γ0

1 γ1

. . . . . .

1 γN−1

1

26

and,

α0 = −(1 + h2x

Q0

2)

αi = −(2 + h2xQi)− γi−1 i = 1, 2, · · · , N − 1

γi = 1αi

i = 0, 1, 2, · · · , N − 1

αN = 1− γN−1

The solution of the desired vector P is done in two steps and recursively (Ap-

pendix B.1). Thus, it takes 2N + 2 serial steps to complete the pressure vector

P computation. It requires 2N + 1 multiplications and 2N additions. This computa-

tional method for the boundary equation is a bottle neck for hardware design since

it is done serially.

The next unit is called Dunit . It calculates the membrane acceleration v′bm and

the derivative of the outer hair cell pressure P ′ohc using eq 2.2.3:

v′bm =1

m[P + Pohc − rvbm − sξbm]

and eq 2.1.15:

P ′ohc = γ(x)

[α2

dξbm

dt+ α1ξbm

]− ωohcPohc

In order to improve the convergence of the initial differential equation the Modified

Euler method is used. It is an iterative method, the number of iterations depends on

the accuracy required (Appendix B.2.2).

ξbm(T + ht) = ξbm(T ) + ht/2 [vbm(T ) + vbm(T + ht)] ,

vbm(T + ht) = vbm(T ) + ht/2 [v′bm(T ) + v′bm(T + ht)] ,

Pohc(T + ht) = Pohc(T ) + ht/2 [P ′ohc(T ) + P ′

ohc(T + ht)]

This procedure is indicated as Munit in Figure 2.3.

The magnitude of the variables including the input, might undergo significant

changes during the computation process. The variables’s magnitude range is about

27

200dB. In order to keep the computation error bounded the approximation of the

membrane’s velocity and acceleration are checked each computational iteration. If

the convergence test represented by the Cunit fails, the algorithm recomputes the

basilar membrane variables again until the variables converge. The convergence test

unit (Cunit) also controls the time step variable ht according to the approximation

error. If the approximation error of the variables is small we may take a larger time

step size ht for the next time point. The decision block diagram is illustrated in

Figure 2.4.

Compare the values of v(x,t) and v’(x,t) with the values received in the previous

iteration. Are they close enough ?

Advance to the next

time step.

Increase step size.

Run another iteration.

Restart this time step

with smaller step size.

Not close at all

Not close enough Very close

Close enough

Figure 2.4: Software convergence unit

2.3 Summary

The cochlea model consists of one boundary value equation and three first-order

initial value equations. The solution of the algorithm is done in two phases. The

boundary value problem is solved analytically with the finite difference method using

the LU method and the initial value equations are solved numerically using Euler and

28

Modified Euler methods. The software solution block diagram is shown in Figure 2.3.

The algorithm uses a fixed resolution along the cochlea, but variable time steps.

The time step size is controlled by the convergence unit (Cunit) which compares the

variables ξbm, ξ′bm values to the values received in the previous iteration and decreases

or increases the time step size if needed. The algorithm continues to the next time

point if the estimated truncation error for all nodal points is less than some threshold.

Chapter 3

The Hardware Model

3.1 Introduction

In the previous chapter we introduced the software algorithm solution for the cochlear

model in the time domain. The simulation of the algorithm yielded a processing time

ratio of about 1 to 1,000, thus it takes about 1,000 seconds to process 1 second of

speech on a pentium4 computer. It was clear that this kind of solution was unsuitable

for real-time application due to long latency.

In order to shorten simulation time for future research applications and for in-

vestigation of feasibility and potential use in hearing aids, a special hardware model

is proposed. A hardware implementation offers a low cost and high speed capability

that would appear to be an attractive approach. As the VLSI technology is getting

smaller and faster we believe the algorithm may work in real-time.

In this chapter we modify and fit the solution of the cochlear model for hardware

design. Each unit is discussed, examined and its modifications are explained.

We concentrate on reducing the algorithm work-load and mainly its large compu-

tational latency which makes it unsuitable for a real-time application. Examination

29

30

of the existing software solution described in the last chapter revealed two major

problems.

The first problem encountered was the variation of the number of computational

iterations in the solution of the algorithm. The need for a constant throughput in

a hardware design is essential and the number of time iterations had to be fixed.

Moreover, we define a constant time step size ht, on the contrary to the software

solution where the number of iterations and the specific time step size were determined

by the convergence unit (Cunit) each iteration. The need for an unvarying time step

size is also essential for the hardware design in order to have a constant throughput.

Finding of the best time step size and of the number of computational iterations

simplify the model solution significantly and makes it more suitable for hardware

design. We find the optimum time step size and the number of time iterations which

converge and have good results.

The second problem encountered was the large latency computational method

of the boundary condition problem as implemented in the software solution. The

boundary condition problem in software is solved by a variation of the LU method,

which is done serially. This serial solution is considered the bottle neck of the al-

gorithm. A new iterative solution is proposed for the hardware solution. The new

solution is solved numerically and had to be integrated into the algorithm. We now

had two iterative numerical solutions integrated one inside the other and had to fix

the number of iterations to be deterministic.

In the proposed hardware algorithm we managed to reduce the work-load and to

reduce the critical-path latency of the algorithm. The modifications of the algorithm

were verified against the original software algorithm.

31

In the following section, we use the software simulations of the cochlear model to

analyze and determine the optimum spatial resolution hx and the optimum constant

time step size ht for the hardware model.

3.2 Determination of the time step size and spatial

resolution

The analysis [7] focused on determining the preferred working points, which are the

optimum time step size ht and the optimum cochlear resolution hx.

In order to estimate the optimum time step size ht, the model was tested for

different sets of step sizes with different input signals. A reference configuration at

the software algorithm was chosen as, ht = 1e − 7 Sec and N = 4096 sections.

Figure 3.1 represents the relative error as a function of maximum time step size for

Figure 3.1: Relative error of the word ”tz” as a function of the time step size

different number of cochlear sections. The results are plotted for the phoneme ”tz”.

A significant change can be seen when ht = 1µSec. Similar results were obtained

32

for other input signals.

We have run similar simulations in order to estimate the best cochlear partition.

Figure 3.2 represents the relative error as a function of the number of cochlear sections

for different time step sizes. The results are for the phoneme ”tz”. Similar results

Figure 3.2: Relative error of the word ”tz” as a function of the spatial resolution

were obtained for other input signals. If we choose a relative error of 10−3 as the

maximum permissable error then:

ht ≤ 1µSec,

N = 512, thus,

hx = l/N = 3.5/512

We have modified the values of ht and hx to be power of 2 in the hardware model to

simplify multiplications with these parameters.

ht = 2−20 ≈ 1µSec,

hx = 2−7 ≈ 3.5/512

33

3.3 The Hardware Model Description

This section describes the units of the cochlear model algorithm as shown in figure 3.3.

We describe the functionality of each unit and discuss the modifications made for the

hardware model.

There are several key points which must be applied when approaching a hardware

design. The algorithm presented in chapter two which is the base for the hardware

design algorithm was not planed for short latency or efficient use.

Our first key point was to convert the parameters to be a number with a base

of 2. This way, every multiplication of a variable with a parameter would turn out

to be a shift operation, which is much easier to implement than multiplication and

almost without delay. This key point also includes the parameter ht and hx which

were changed to numbers with the base of 2. xi represents the ith coordinate on the

cochlear membrane cored where 0 ≤ xi ≤ 3.5 cm. We define

xi = x0 + i · hx

where x0 = 0. We chose hx = 2−7, thus xN = 3.5 for i = 448 and i is defined for

0 ≤ i ≤ 448. We have chosen the number of sections, N , to be 448. ht was changed

to 2−20 which is very close to 1µSec.

The second key point is to design a synchronous design which will have a constant

throughput. The software algorithm uses the convergence unit to decide about the

next time step size, moreover, it decides the number of computational iterations for

each time step. This way, the computation for each time step takes different time.

In the hardware design we have fixed the time step size to be constant and equal to

ht = 2−20. The number of time iterations was also studied and fixed.

34

Design flow

Eunit

Gunit

Punit

Dunit

MEunit

Input block

Output block

P iterations

T i t

e r a t

i o n

s

Figure 3.3: Flow chart of the time domain solution algorithm for the hardware model.

35

The third key point is to design the algorithm with a minimal latency. As massive

computations are needed for each time step, our goal is to design a real-time appli-

cation. In order to shorten the processing time, a parallel architecture is planned.

Moreover, we also plan and investigate a pipeline architecture. Design of parallel and

pipeline architectures certainly shortens the computation latency but as expected,

there are tradeoffs. In the following sections we describe each of the algorithm’s units

in details.

3.3.1 Eunit

The Eunit is the Euler computation unit. It predicts the basilar membrane displace-

ment ξbm, velocity vbm and the outer hair cell pressure Pohc variables for all of the

sections on the basilar membrane cored for the next time step T + ht. It uses the

Euler method as explained in chapter two. The three equations used in the software

design for x0 ≤ x ≤ xN are:

ξbm(x, T + ht) = ξbm(x, T ) + ht × vbm(x, T )

vbm(x, T + ht) = vbm(x, T ) + ht × v′bm(x, T )

Pohc(x, T + ht) = Pohc(x, T ) + ht × P ′ohc(x, T )

(3.3.1)

Fixing ht to be 2−20 in the hardware design replaces the multiplications in the equa-

tions to shift operations for x1 ≤ x ≤ xN−1:

ξbm(x, T + ht) = ξbm(x, T ) + shift(vbm(x, T ),−20)

vbm(x, T + ht) = vbm(x, T ) + shift(v′bm(x, T ),−20)

Pohc(x, T + ht) = Pohc(x, T ) + shift(P ′ohc(x, T ),−20)

(3.3.2)

where x represents the basilar membrane partition. The membrane variables com-

puted above are irrelevant at the boundary points x0 and xN since the membrane’s

36

pressure P is already known at these points. Thus, they are not computed. As ex-

plained before, the hardware design uses 448 sections unlike the software which uses

512 sections.

As seen from equations 3.3.1, the software Eunit requires three multiplications and

three additions for every x coordinate for one time step while the hardware design

only needs three additions. The Eunit is only calculated once for each time step.

Since there is no dependency in the Eunit between the x coordinates, it is possible to

calculate all coordinates at once assuming we do it in parallel having N − 1 Adders.

Hence, the latency of the hardware design could be reduced to one addition. Table 3.1

summarizes the latency and work-load of the Eunit.

Parameter Software HardwareLatency additions 3(N − 1) 1

multiplications 3(N − 1) 0Work additions 3(N − 1) 3(N − 1)

multiplications 3(N − 1) 0shifts 0 3(N − 1)

Table 3.1: Comparison of Eunit work-load and latency between software and hardwareimplementations.

3.3.2 Gunit

The Gunit computes two separate variables, the g vector and the outer hair cell

pressure derivative P ′ohc.

The g variable doesn’t have any physical meaning, it is a mathematical description

of the vector b in the system Ax = b for the boundary condition computation in the

37

next unit. The vector g is defined by the following equation:

g(x, T + ht) = −K(x)× (ξbm(x, T + ht)/c(x) + r(x)vbm(x, T + ht) + γPohc(x, T + ht))

(3.3.3)

where,

g(x0) = input

g(x448) = 0

K(x) = 2ρβm(x)A

−→ 2−6

m(x)

The original parameters in the calculation are defined in table 3.2.

List of model ParametersParameter Value/Definition units Description

` 3.5 cm Cochlear Lengthρ 1 gr/cm3 Density of perilymphβ 0.15 cm Width of the basilar membraneγ 0.5 Outer hair cell gainA 25 cm2 Cross-sectional area of the cochlea scalaem 1.267 · 10−6e1.5x gr/cm2 Basilar membrane mass per unit areac 7.8 · 10−5e1.5x gr/cm2sec2 Basilar membrane stiffness per unit arear 0.25e−0.6x gr/cm2sec Basilar membrane resistance per unit area

Table 3.2: List of original model Parameters

The mass m(x), restrain r(x) and elasticity c(x) vectors are fixed. Those numbers

are computed beforehand and are stored in tables.

By rewriting equation 3.3.3 as,

g(x, T+ht) = (−K(x)/c(x))ξbm(x, T+ht)−K(x)r(x)vbm(x, T+ht)−γK(x)Pohc(x, T+ht)

and computing the equations coefficients beforehand, the computation of the g vari-

able for each section requires three multiplications and two additions. In the best

38

case, when having 3(N − 1) multipliers in parallel we can obtain a latency of one

multiplication and two additions.

The computation of the outer hair cell pressure derivative P ′ohc is given by the

following equation:

P ′ohc(x, T +ht) = Kohc(x)×(vbm(x, T +ht)−w1(x)ξbm(x, T +ht))−w0×Pohc(x, T +ht)

(3.3.4)

where,

Kohc(x) = −r(x)× w0

w0 = 2πFohc = 2π × 1500 = 9424.778 −→ 212

w1(x) = 1m(x)c(x)w0

The value of the parameter w0 was also rounded to a number of base 2 converting

the multiplication of Pohc to shift operation. The computation of the P ′ohc may take

place anywhere between the Gunit and the MEunit as the value of P ′ohc is only needed

either for the Eunit or MEunit.

Since we have already calculated the value of ξbm(x, T + ht)/(m(x)c(x)) for the g

vector calculation, there is no need to recalculate it again, we will simply store and

reuse it. In this case the multiplication of w1 with ξbm turns to a shift operation of

w−10 .

The calculation of the P ′ohc requires one multiplication and two additions per

section. The latency of the P ′ohc is not taken into account as it can be done between

the Gunit and the MEunit.

It is important to mention, that the calculations of all the units except the Eunit,

are repeated for a couple of times, as seen in figure 3.3. The number of time iterations

39

was determined upon simulations and will be discussed later.

The work-load and latency of the Gunit is summarized in table 3.3.



multiplications 6(N − 1) 4(N − 1)shifts 0 2(N − 1)

Table 3.3: Comparison of Gunit work-load and latency between software and hard-ware implementations.

3.3.3 Punit

The Punit is the unit which solves the boundary condition problem as described in

eq: 2.2.7,

∂2P

∂x2−QP = QG (3.3.5)

with the boundary condition:

P (x, t) = S(t) x = 0

P (x, t) = 0 x = `

The solution of the boundary condition problem is the major calculation in the

algorithm. The method chosen for its solution in software was the LU-decomposition

method described in Appendix B.1. The LU-decomposition method is solved seri-

ally. It requires 2 × (N + 1) sequential steps which make it unsuitable for hardware

implementation. The Punit is a bottle neck as shown in figure 3.4.

The algorithm used in the software solution provides an exact solution. We choose to

consider an iterative method that will be solved in parallel. Iterative methods work

40

Eunit

Euler

Gunit

Gunit

Punit

Dunit

Dunit

MEunit

MEunit input Output

Sections

Figure 3.4: Illustration of the punit as a bottle-neck

by continually refining an initial approximate solution so that it becomes closer and

closer to the correct solution. In some cases, iterative algorithms require substantially

less time and/or fewer processors than do their exact algorithm counterparts.

We have chosen the Jacobi Relaxation method [24] in order to solve the boundary

condition problem. The Jacobi Relaxation method is an iterative method which

enables us to solve the equation in a parallel way. In the following subsection we

explain the Jacobi method.

Jacobi Relaxation

Considers the N × N system of equations A~x = ~b, where we assume that A = (aij)

is invertible (so that ~x has a unique solution), and that the diagonal entries of A are

nonzero. Rewriting the ith equation and solving for xi, we find that:

xi =−1

aii

(∑

j 6=i

aijxj − bi) (3.3.6)

for 0 ≤ i ≤ N , given an approximate solution ~x(t) to the system of equations. One

41

natural way to update the solution would be to reformulate equation 3.3.6 as:

xi(t + 1) =−1

aii

(∑

j 6=i

aijxj(t)− bi) (3.3.7)

Updating the solution for ~x by equation 3.3.7 is known as Jacobi iteration or Jacobi

relaxation, and can produce solutions that are close to optimal in a reasonable number

of iterations provided that the matrix A satisfies certain properties.

Let define D as

D =

a11 0

0 a22 0. . . . . . . . .

0 aNN

and M as

M = D−1(D − A)

If we rewrite equation 3.3.7 in vector form:

~x(t + 1) = −D−1((A−D)~x(t)−~b)

= M~x(t) + D−1~b(3.3.8)

and let

~ε(t) = ~x(t)− ~x (3.3.9)

denote the vector amount by which ~x(t) differs from the exact solution ~x, then sub-

stituting eq. 3.3.8 in eq. 3.3.9 yields,

~ε(t + 1) = ~x(t + 1)− ~x

= M~ε(t)(3.3.10)

Jacobi relaxation converges to the correct solution for ~x provided that M t converges

to zero as t → ∞ where M = D−1(D − A) and D is diagonal matrix containing the

42

diagonal entries of A. (Equivalently, the algorithm converges to the correct solution

provided that all of the eigenvalues of M have magnitude less than one.) Thus,

~ε(t) = M t~ε(0) and ~ε(t) → 0 if M t → 0 as t → ∞. The rate of convergence depends

on how close the eigenvalues of M are to 1 in absolute value.

Applying Jacobi Relaxation in the Punit

The boundary condition problem is represented as a linear system AP = B (chapter

two and Appendix B.1) where,

P =

P (0)

P (1)...

P (N − 1)

P (N)

B =

S(T )

0...

0

0

+ h2x

0

G1Q1

...

GN−1QN−1

0


A =

1 0

1 −(2 + h2xQ1) 1

. . . . . . . . .

1 −(2 + h2xQN−1) 1

0 1

Dividing the system by h2x yields:

1 0

u mi1 u. . . . . . . . .

u miN−1 u

0 1

p(0)

p(1)...

p(N − 1)

p(N)

=

g(0)

g(1)...

g(N − 1)

g(N)

(3.3.11)

43

where the g vector was defined and calculated in the previous unit (Gunit) and u and

mii are defined as:

u = 1/(h2x) = 1/(2−7)2 = 1/(2−14) = 214

mii = −(2/h2x + K(i)) = −(215 + K(i)) i = 1, 2, · · · , N − 1

(3.3.12)

p is the basilar membrane pressure vector we want to obtain. Applying the Jacobi

Relaxation method described in equation 3.3.7 we get:

pn+1(i) = 1mii

× (g(i)− u(pn(i− 1)− pn(i + 1))) i = 1, 2, · · · , N − 1

p(0) = g(0) = input

p(N) = g(N) = 0

(3.3.13)

where the number n represents the iteration number. The first approximation for

the vector p, (p0(i)) will be the last pressure vector, p, which was computed. The

multiplication in u in equation 3.3.13 will be a shift operation since u equals 214.

Now, using Jacobi Relaxation method for solving the boundary equation, we can

implement the Punit in a parallel way, computing the basilar membrane pressure at

coordinate i with a couple of iterations. We compute all of the coordinates i = 0, · · ·Nat the same time in a parallel way making the computational latency minimally and

equal to a Punit computation for one section. The parallel architecture is illustrated

in figure 3.5.

The number of iterations will influence the computational precision of the pressure

vector. Applying more iterations will certainly increase precision. On the other hand,

applying more iterations will increase the latency of the Punit.

The LU-decomposition method introduced for the software solution requires 2N+1

multiplications and 2N additions. Its computational latency is the same since the

method works sequentially. The Jacobi Relaxation method requires one multiplication

and two additions per section. As we have N − 1 coordinates after the elimination

44

Eunit

Euler

Gunit

Gunit

Punit Dunit

Dunit

MEunit

MEunit input Output

Sections

Punit

Figure 3.5: Illustration of the parallel punit

of the first and last coordinates, the number of multiplications is N − 1 and the

number of additions is 2N − 2 for an iteration. If we need about Piter iterations to

reach the solution of the Punit, then the total work of the Punit will be Piter(N − 1)

multiplications and Piter(2N − 2) additions. Although the hardware solution requires

more work than the software solution, its latency is by far shorter. The latency of

each iteration in the Punit is one multiplication and two additions. If we need Piter

iterations to compute the Punit, the total latency is Piter multiplications and 2×Piter

additions. The work-load and latency of the Punit is summarized in table 3.4.

Parameter Software HardwareLatency additions 2N Piter × 2

multiplications 2N + 1 Piter × 1Work additions 2N Piter(2(N − 1))

multiplications 2N + 1 Piter(N − 1)shifts 0 Piter(2(N − 1))

Table 3.4: Comparison of Punit work-load and latency between software and hardwareimplementations.

45

Punit convergence

As described in Jacobi Relaxation subsection, it converges to the correct solution

for ~x provided that M t converges to zero as t → ∞. Figure 3.6 demonstrates the

convergence of M t to zero as t increases when we applied the specified condition for

matrix A according to equation 3.3.11.

0 5 10 15 2014

15

16

17

18

19

20

21

22

23

24Matrix energy

iterations

ener

gy [d

B]

Figure 3.6: Jacobi matrix convergence

3.3.4 Dunit

The Dunit calculates the membrane acceleration. This unit follows the Punit as seen

in figure 3.3. The acceleration is calculated from the initial value problem,

v′bm = 1m

[P + Pohc − rvbm − sξbm]

v′bm = 1m

[P + g/K](3.3.14)

46

where P is the pressure vector taken from the Punit. All other variables and parame-

ters have been calculated before. Equation 3.3.14 is used for the software solution. It

requires 2(N − 1) multiplications and N − 1 additions for computing the acceleration

vector v′bm for all the coordinates (we exclude the first and last coordinates). The

latency according to the software solution is one addition and two multiplications for

one section, which is very significant.

Using the equations developed in the Punit, we will evaluate a new expression for

the vector g/K. Representing Eq. 3.3.13 and substituting Eq. 3.3.12 yields:

g(i) = 1/h2x × (p(i− 1)− 2p(i) + p(i + 1))−K(i)p(i) (3.3.15)

since we already know that

K(i) =2−6

m(i)

Dividing vector g by the vector K and substituting K(i) yields:

g(i)/K(i) = 26m(i)/h2x × (p(i− 1)− 2p(i) + p(i + 1))− p(i)) (3.3.16)

Substituting g/K from equation 3.3.16 in the Dunit equation 3.3.14 reveals:

v′bm =1

m(i)

[p− 26m(i)/h2

x × (p(i− 1)− 2p(i) + p(i + 1))− p]

(3.3.17)

Substituting 1/h2x = 214 yields:

v′bm = 220 [p(i− 1)− 2p(i) + p(i + 1)] (3.3.18)

The new expression for the basilar membrane acceleration is much more suitable

for the hardware design. Using equation 3.3.18 for the hardware model requires only

two additions per section, thus 2(N−1) additions. The Dunit latency will apparently

be two additions which is better from the software solution. We summarize the work-

load and latency of the Dunit in table 3.5.

47




Table 3.5: Comparison of Dunit work-load and latency between software and hard-ware implementations.

3.3.5 MEunit

The MEunit is the last unit in the algorithm sequence. It computes the displacement,

velocity and outer hair cell pressure along the basilar membrane using the Modified

Euler method. The Modified Euler equations are:

ξbm(T + ht) = ξbm(T ) + ht/2 [vbm(T ) + vbm(T + ht)] ,

vbm(T + ht) = vbm(T ) + ht/2 [v′bm(T ) + v′bm(T + ht)] ,

Pohc(T + ht) = Pohc(T ) + ht/2 [P ′ohc(T ) + P ′

ohc(T + ht)]

(3.3.19)

In the hardware model, ht/2 = 2−20/2 = 2−21 and its multiplication turns to a shift

operation. The MEunit may compute all of the three variables in parallel. Only six

additions are necessary for one coordinate thus it requires 6(N−1) additions totaly for

this unit. The latency will be composed of two additions only assuming we compute

all sections in parallel. The latency could be even less if the computations would

start earlier, since the only variable which limits this computation is the acceleration

v′bm(T + ht) which is computed in the previous unit (Dunit). Therefore, the latency

could be reduced to even one addition. We summarize the work-load and latency of

the MEunit in table 3.6.

48




Table 3.6: Comparison of MEunit work-load and latency between software and hard-ware implementations.

3.4 Computational Analysis

To this point, a basic hardware model for the cochlea was introduced. A new method

for solving the boundary condition equation was proposed in order to fit a parallel

architecture. We have also converted most of the parameters to be power of 2 and

set the number of Titer and Piter to be constant. The flow diagram of the algorithm

as described is illustrated in figure 3.7.

We analyze the hardware model in two categories. The first category is the total

work load of the algorithm as implemented in hardware versus the software or original

implementation. The second category analyzed is the critical-path latency which

indicates the minimum computational time needed.

Starting with the first category, the work-load of an algorithm effects the power

consumption, silicon area, complexity and processing or computational time. Clearly,

we would like to reduce as much as we can the work-load of the algorithm. The hard-

ware model consists from two iterative methods. As seen in figure 3.7, the Punit

is solved in Piter iterations and the whole algorithm flow is computed using Titer it-

erations. There is a tradeoff between the number of computational iterations and

49

Design flow

Eunit

Gunit

Punit

Dunit

MEunit

Input block

Output block

P iterations

T i t

e r a t

i o n

s

Figure 3.7: Block diagram of the hardware model.

output precision. Fewer number of iterations can reduce the work-load but will harm

performance. The performance analysis (chapter four) have shown that the basic

architecture for the hardware model is given when the number of iterations are con-

figured to Titer = 3 and Piter = 5 for each time step of ht = 1µs. The Euler unit

(Eunit) is computed only once at the beginning of every time point and the other

units are repeated Titer times for each time point. The Punit is solved by the Jacobi

method with Piter iterations each time.

In table 3.7 and table 3.8 we compare the work-load of the software and hardware

models. The number of additions, multiplications and shifts are displayed per section

and then multiplied by the parameter N (number of sections) and by Titer and/or

Piter iterations which represent the number of times the unit is calculated for each

time point.

50

The Work-Load in the software modelPer Section Total

Unit Additions Mult. Shifts Additions Mult. ShiftsEunit 3 3 0 3(N − 1) 3(N − 1) 0Gunit 4 6 0 6(4(N − 1)) 6(6(N − 1)) 0Punit 2 2 0 6(2N) 6(2N + 1) 0Dunit 1 2 0 6(N − 1) 6(2(N − 1)) 0

MEunit 6 3 0 5(6(N − 1)) 5(3(N − 1)) 0Total 16 16 0 75N − 63 78N − 60 0

Table 3.7: An analysis of the work-load of the software model.

The Eunit in the following tables is not multiplied by Titer because it is only

computed once for each time point. In addition, the MEunit number of Titer is

reduced by one as the first iteration includes the Eunit computation. It is impossible

The Work-Load in the hardware model for Piter × Titer architecturePer Section Total

Unit Add. Mult. Shifts Additions Mult. ShiftsEunit 3 0 3 3(N − 1) 0 3(N − 1)Gunit 4 4 2 T (4(N − 1)) T (4(N − 1)) T (2(N − 1))Punit 2× P 1× P 2× P T (2P (N − 1)) T (P (N − 1)) T (2P (N − 1))Dunit 2 0 2 T (2(N − 1)) 0 T (2(N − 1)MEunit 6 0 3 (T − 1)(6(N − 1)) 0 (T − 1)(3(N − 1))Total 15 + 2P 4 + P 10 + 2P

The Work-Load in the hardware model for Piter × Titer architectureOperation ValueAdditions (2PiterTiter + 12Titer − 3)(N − 1)Multiplications (TiterPiter + 4Titer)(N − 1)Shifts (2PiterTiter + 7Titer)(N − 1)

Table 3.8: An analysis of the work-load of the hardware model.

to know the exact number of the total work-load for the software model since the

number of time iterations is not deterministic. Hence, we have averaged the total

51

number of time iterations for the software algorithm and it is set to 6.

The hardware model work-load is calculated for a Piter × Titer architecture, which

means, Titer time iterations and Piter Jacobi iterations when solving the Punit. In the

basic architecture discussed in chapter four we configure Piter and Titer to be 5 and

3 respectively. The hardware model work-load is calculated for this architecture in

table 3.9.

The Work-Load in the hardware model for 5× 3 architecturePer Section Total

Unit Add. Mult. Shifts Additions Mult. ShiftsEunit 3 0 3 3(N − 1) 0 3(N − 1)Gunit 4 4 2 3(4(N − 1)) 3(4(N − 1)) 3(2(N − 1))Punit 2× 5 1× 5 2× 5 3(10(N − 1)) 3(5(N − 1)) 3(10(N − 1))Dunit 2 0 2 3(2(N − 1))) 0 3(2(N − 1))

MEunit 6 0 3 2(6(N − 1)) 0 2(3(N − 1))Total 25 9 20 63(N − 1) 27(N − 1) 51(N − 1)

Table 3.9: An analysis of the work-load of the hardware model for 5×3 configuration.

We can see from the comparison of the basic architecture hardware model and the

software model that the hardware model requires less operations for a computation

of 1µSec speech. A significant change is seen in the number of multiplications. The

number of multiplications for 1µSec of speech for software is about 78N and for

hardware it is about 27N. The multiplications ”cost” more than the other operations

and their reduction was important. Most of the multiplications were converted to

shift operations.

The second category in the computational analysis deals with the critical-path.

The critical-path represents the maximum path of the hardware model which indicates

52

the minimum latency of the computational processing.

The number of operations in the critical path must be done within the time frame

of ht (1µSec) when real-time application is required. The software solution run on

a single processor where it executes only few operations at a time. Today, advanced

processors have different instruction queues for different operations such as additions

and multiplications for floating-point and integer numbers [18]. Although each queue

has its own ALU, the dependencies between the instructions are kept. We can assume

that the software work is done serially since we have one general purpose processor

making the critical path the same as the total work load.

In the hardware model we assume a computational block (ALU) for each section

enabling a parallel processing. The critical path is displayed in Table 3.10. We do

not consider here a pipeline architecture improvement. A pipeline architecture can

reduce the critical path by a factor of Titer. The time-iteration processing can be

implemented with a linear array topology. We discuss this issue later.

The Critical-Path in the hardware model for Piter × Titer architectureLatency per time iteration Total latency

Unit additions multiplications additions multiplicationsEunit 1 0 1 0Gunit 2 1 T × 2 T × 1Punit 2× P 1× P T × 2× P 3× PDunit 2 0 T × 2 0MEunit 1 0 (T − 1)× 1 0Total 2P + 6 P + 1 2TP + 5T TP + T

Table 3.10: An analysis of the critical path of the hardware model for Piter × Titer

architecture.

If we implement the ”5x3” architecture where Piter = 5 and Titer = 3 for ht = 1µSec

53

we must execute 18 multiplications and 45 additions in 1µSec. We assume that shift

operations are done with no delay. In the next chapter we discuss the timing analysis.

3.5 Pipeline Architecture

Since the cochlea model algorithm is solved numerically, in an iterative method,

the implementation of a pipeline architecture might fit this problem. In the basic

hardware model proposed, we chose a 5X3 architecture, which represents 5 iterations

in the Punit and 3 time iterations. In the following pipeline architecture, we divided

the three time-iterations into three stages as shown if figure 3.8.

output

E u

n i t

M E

u n i t

D u

n i t

P u

n i t

G u n

i t

5 iterations

Every 0.1 micro second

M E

u n i

t

D u

n i t

P u

n i t

G u n

i t

M E

u n i t

D u

n i t

P u

n i t

G u n

i t

Input

Figure 3.8: A Pipeline architecture for the hardware model with the 5X3 combination.

In the pipeline architecture we start to compute each stage using the best esti-

mation of the variables from the previous stage. The inputs of the variables to the

Eunit are taken after the MEunit of the first stage, thus only after one iteration. The

second stage for the same time point will start with the outputs of the first stage and

54

the updated variables from the third stage. Each time point actually passes three

stages which are the three time-iterations, but the stages do not start with the most

updated variables as would expected in the basic hardware model. The Punit first

approximation for the basilar membrane pressure vector would be taken again from

the previous stage.

If for a real-time application the whole sequence needs to be computed in 1µSec

since it is the core data rate, now only one stage out of the three needs to be computed

in that time. So, the pipeline architecture reduces the amount of work for the specified

time by a factor of Titer. This fact enables a lower clock frequency which reduces

complexity.

The advantages of the pipeline architecture are in the shortening of the computa-

tional latency and the usage of the units at the same time. The major disadvantage

of this architecture is the usage of more silicon area since we need to implement all

of the units and we cannot compress them.

The pipeline architecture was implemented in the C code and verified. It showed

good results which are presented in the next chapter.

3.6 Delta Architecture

The basic software and hardware model numerical solutions solve the algorithm as a

two dimensional discrete net. As the algorithm proceed, it holds the absolute value

of the variables. It was interesting to check if the conversion of the algorithm to work

with the variables’s differences might improve the performance of the algorithm for the

hardware model. By the implementation of this delta architecture we might also save

computational operations. The following equations for the units were implemented.

55

The d < variable > represents the delta or difference of a variable. The differential

equations for the Eunit are:

d ξ(T + ht) = 2−20 × v(T ),

d v(T + ht) = 2−20 × v′(T ),

d Pohc(T + ht) = 2−20 × P ′ohc(T )

(3.6.1)

Gunit:

d g = −K[d ξ/c + rd v + 0.5d Pohc](3.6.2)

Punit:

d p(i) = K[d g − u(d p(i− 1) + d p(i + 1))(3.6.3)

Dunit:

d P ′ohc = Kohc[d v − w1dξ]− w0d Pohc,

d v′ = 220[d p(i− 1)− 2d p(i) + d p(i + 1)] (3.6.4)

MEunit:

d ξ = 2−21[v(T )− v(T + ht)] = 2−21[2v(T ) + d v(T + ht)],

d v = 2−21[v′(T )− v′(T + ht)] = 2−21[2v′(T ) + d v′(T + ht)],

d Pohc = 2−21[P ′ohc(T )− P ′

ohc(T + ht)] = 2−21[2P ′ohc(T ) + d P ′

ohc(T + ht)](3.6.5)

Only at the end of the time step after the Titerth iteration the following equations

need to be computed (not in a pipeline architecture).

v(T + ht) = v(T ) + d v(T + ht),

v′(T + ht) = v′(T ) + d v′(T + ht),

P ′ohc(T + ht) = P ′

ohc(T ) + d P ′ohc(T + ht)

(3.6.6)

56

Compared to the basic hardware model, we save 3 additions per section in the

Eunit and save 3 additions per section in the MEunit each time-iteration. But we

need to add 3 additions per section as stated in equation 3.6.6 at the end of a time

step. So, the delta architecture does not reduce the algorithm latency and work-load

dramatically.

We have implemented the delta architecture in the C code and verified its perfor-

mance. Histograms for the variables and differential variable were plotted and a fixed

point representation for the variables was defined. The results of the simulations and

fixed-point representation are discussed in the next chapter.

3.7 Summary

In this chapter we introduced the hardware model for the cochlea. The software model

was written for simulation and validation of the cochlea model when the research be-

gun. It was not written with the notion of hardware implementation. Therefore, it

is not suitable for hardware design. The major problem of the software algorithm

which is the Punit sequential solution method is solved in the hardware model. The

number of iterations for both numerical methods used in the algorithm were fixed and

determined as we expect constant throughput. The parameters had to be converted

to numbers with the power of 2. In addition to the basic hardware algorithm we in-

troduced two more modifications to the model. A pipeline and a differential variables

architectures. All options were simulated and evaluated. We discuss the results in

the next chapter.

Chapter 4

Evaluation of the HardwareAlgorithm

In the previous chapter we have introduced the basic hardware model for the cochlea.

We have also suggested several architectural modifications such as pipelining and

differential representation. In this chapter we present the simulations results and

analyze it. We also evaluate the representation of the variables for the hardware

design.

The hardware model solution was implemented in C++ and run on a PC. It was

then implemented on a FPGA simulator in order to verify and validate its perfor-

mance.

The output of the simulated model is represented by a two-dimensional matrix.

The elements of the matrix represent the cochlear partition velocities (vbm) for every

time point. The matrix rows represent the longitudinal axis of the cochlear partition

and the matrix columns represent the time axis. The time axis resolution is sampled

at a rate of 44.1 kHz. The longitudinal resolution equals to the number of sections

which is 448. This representation is conceptually similar to a spectrogram of the input

57

58

signal. An example showing the outputs of the cochlear model for three different input

signals is illustrated in figure 4.1. The three input signals are (1) a chirp from 500Hz

to 8KHz (2) a combination of several frequencies and (3) the word ”Mitz”. (Defined

in table 4.1).

0 200 400 600 800 1000−1

−0.5

0

0.5

1

amp.

chirp

0 200 400 600 800 1000−1

−0.5

0

0.5

1

amp.

sin

0 1 2 3

x 104

−1

−0.5

0

0.5

1

time samples

amp.

mitz

sect

ions

chirp

1000 2000 3000 4000

100

200

300

400 −800

−600

−400

−200

0

sect

ions

sin

1000 2000 3000 4000

100

200

300

400 −800

−600

−400

−200

0

time samples

sect

ions

mitz

4000 8000 12000

100

200

300

400 −800

−600

−400

−200

0

Figure 4.1: An example of the hardware model outputs. The input signals are dis-played on the left side and the output matrixes are displayed on the right side. Uppergraphs: chirp; middle: combination of several frequencies; lower: the word ”Mitz”.

The output matrixes shown in figure 4.1 represent the energy for every section

along the time domain. Each section represents a different frequency. The first

sections represent the higher frequencies and the last sections represent the lower

frequencies. For the chirp output we can see a logarithmic energy line which goes

59

up along the time to lower sections or higher frequencies. For the second output, a

combination of several frequencies, we can see horizontal energy lines along the time

for different sections. The model amplifies the lower sections more than the higher

sections which accounts for the difference in the colors of the energy lines. Finally, for

the word ”Mitz”, we can see that it is composed from two parts which are the ”Mi”

that holds low and high frequencies and the ”Tz” that holds only high frequencies.

In order to verify and validate the hardware model’s various implementations we

compare the output matrix to the output matrix of the original software model. We

use the original software model as our reference. The comparison criteria is the mean

square error between the points of the matrixes divided by the mean energy of the

reference matrix. A relative error in each cell will be (Aij−Refij

Refij)2, but since a large

part of Refij = 0 we chose the relative error to be

Relative Error =1

NT ·Nx

∑i

∑j(Aij −Refij)

2

1NT ·Nx

∑i

∑j(Refij)2

(4.0.1)

where NT · Nx is the number of elements in A. A and Ref represent the matrix of

the hardware model and the software reference matrix, respectively.

In order to evaluate the hardware model we ran the simulation on different input

signals. We used basic synthetic signals described in table 4.1. For a more reliable

representation we used a list of Hebrew words called HAB.

The HAB list are a Hebrew adaption to the AB List. The AB list is a set of

monosyllabic meaningful words, which comprise consonant-vowel-consonant (CVC)

words. The list was designed, so that the different phonemes in English shall be

equally distributed throughout the entire list [3]. The AB list is commonly used in

hearing tests as it reduces the effect of word frequency and/or word familiarity on

test scores. The HAB list was designed for Hebrew natives, and it consists of 15 lists

60

Signal length[Sec] sample rate[KHz] DescriptionSin 0.1 50 Combination of several frequencies, 250Hz,

500Hz, 750Hz, 1, 2, 4, 6, 8KHz.Sinc 0.01 50 Center at 5ms, with BW of 22.5KHz.

Chirp 0.1 50 Linear frequency from 0.5 to 8KHz.Click 0.01 50 A 0.1ms click at 2 ms.Mi 0.1 44.1 First part of the word Mitz.Tz 0.1 44.1 Second part of the word Mitz.

Table 4.1: The Synthetic input signals.

Signal length[Sec] sample rate[KHz]Buz 0.6 44.1Chug 0.5 44.1Dov 0.5 44.1Eich 0.5 44.1Kir 0.45 44.1La 0.4 44.1

Mitz 0.6 44.1Pas 0.5 44.1Shen 0.6 44.1Tof 0.5 44.1

Table 4.2: The Hebrew words input signals.

of 10 monosyllabic words such as ”shen”, ”kir”. Each list consists of ten phonetically

balanced CVC words. The term ”phonetic balance” indicates that speech material

has a phonemic composition equivalent to that of everyday speech. We used one of

the 15 lists shown in table 4.2. The HAB list was recorded by a single female speaker

with a sampling rate of 44.1kHz. This list is commonly used in hearing tests for

clinical evaluation in Israel and particulary in the Communication Disorder group in

”Sheba Medical Center”.

61

4.1 Punit Integration

The replacement of the Punit solution method had to be evaluated for convergence

and appropriate performance. The analytical solution was substituted by a numerical

method called Jacobi Relaxation and applied in parallel. In the software model we

estimated the truncation error of the basilar membrane velocity every time-iteration

and kept on going until an adequate error was reached. On the contrary to the

software model, the hardware model throughput must be constant implying a fixed

number of computational iterations. In order to verify the integration of the Punit

and determine the optimal number of both Piter and Titer iterations, various imple-

mentations of time and punit iterations were tested. The outputs were compared to

the original software model.

0.10%

1.00%

10.00%

100.00%

1000.00%

10000.00%

0 10 20 30 40 50 60

t iterations

rela

tive

err

or

p=1

p=3

p=4

p=2

p=5

p=50

Figure 4.2: The relative error as a function of t iterations when p iterations areconstant.

62

Figure 4.2 demonstrates the relative output error of the hardware model versus

the software model as a function of different time iterations for various configurations

of Piter iterations. Figure 4.3 demonstrates the relative output error as a function

of different punit iterations for various configurations of time iterations. It is clearly

noticeable that the new system with the Punit, converges after few iterations.

0.10%

1.00%

10.00%

100.00%

1000.00%

10000.00%

0 5 10 15 20 25

p iterations

rela

tive

err

or

t =2

t = 1

t = 3

Figure 4.3: The relative error as a function of p iterations when t iterations areconstant.

Several combinations between the number of time and Punit iterations can be set.

4.2 Results for Different Configurations

We have ran the simulations on the synthetic signals and recorded Hebrew words

for our various hardware model implementations. We define the writing of M × N

configuration as M punit iterations (Piter) and N time-iterations (Titer). The left

63

hand number describes the punit iterations and the right hand number describes the

time-iterations. The relative MSE was computed for each input signal. Figure 4.4

and Figure 4.5 demonstrate the relative errors of the hardware model with different

configurations for the Hebrew words and synthetic signal, respectively. We chose

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

buz

chug

do

v

eich kir

la m

itz

pas

shen

to

f

rela

tive

err

or

5x3

5x3pipe

10x3

3x3

5x2

10x2

10,5

Figure 4.4: Relative error for different time and punit iterations configurations forHab1 word list.

the 5X3 configuration as the best since its relative error reaches 1%. We can also

see that it is possible to run with a 3X3 configuration, but we get a relative error

higher than one percent for some of the tested words. We can also see that applying

a pipeline architecture to the 5X3 configuration raises the basic 5X3 relative error

graph in a parallel way about a 0.5%. The best performance, is observed in the

10X3 configuration in both graphs, indicating the significance of the punit iterations.

The configuration ”10, 5” is composed of two time-iterations where the punit does

10 iterations in the first time and 5 iterations in the second time iteration. We find

64

0.00%

0.10%

0.20%

0.30%

0.40%

0.50%

0.60%

0.70%

0.80%

0.90%

1.00%

chirp sin mi tz click sinc

rela

tive

err

or

5x3

10x3

3x3

5x2

10x2

10,5

Figure 4.5: Relative error for different time and punit iterations configurations forsynthetic signals.

this implementation unpopular because it is not modular. The work-load for every

time-iteration will be different making it harder for implementation and unsuitable

for pipelining.

The recorded Hebrew words were also tested with an additive gaussian white

noise. We created noisy words with 15dB SNR. The results of the simulations are

plotted in figure 4.6. The graph is plotted for a 5X3 architecture configuration.

The ”reg” line represents the relative error for the clean words. The ”snr15” line

represents the relative error for the words with 15dB noise applied and the third line

”snr15,pipe” introduce the effect of adding a pipeline architecture. As seen from the

graph, unexpectedly, the relative errors for the tested noisy words were lower than

the clean ones. Since the algorithm is numeric and we deal with small errors, we

65

prescribe this phenomenon to numerical computation imprecision. It is a known fact,

that sometimes the addition of noise to a quite signal might help numerical algorithm

converge faster. We may also see from the graph that the implementation of a pipeline

architecture increases the relative error by about 0.1%.

0.00%

0.20%

0.40%

0.60%

0.80%

1.00%

1.20%

buz

chug

do

v eic

h kir la

mitz

pa

s sh

en

tof

rela

tive

err

or

reg

snr15

snr15,pipe

Figure 4.6: Relative error for different configurations when noise is applied. reg: a5X3 configuration using clean words; snr15: a 5X3 configuration using noisy words;snr15,pipe: a 5X3 pipeline configuration using noisy words. The confidence intervalwas set to 99%, number of simulation was 5.

Another aspect of our research was to investigate the influence of a reduction in

the signal input data rate on the model. The input signals are sampled at 44.1 kHz,

a typical audio sample rate, and by a linear interpolation we turn the sample rate to

about 1 MHz. This extremely high sample rate is dictated by the parameter ht in the

model. Since the model’s algorithm is solved in numerical method, it must use tiny

66

steps in order to converge and not deviate. The tuning curves of the filters imple-

mented in the algorithm may be very steep and change by tenths of dB between two

consecutive sections. This high rate might be the algorithm’s biggest disadvantage,

as the work-load is influenced directly from the high core data rate. Moreover, as we

seek to reach a real-time application, the high rate dictates a high process time for

the computational algorithm.

time step influence

0.00%

1.00%

2.00% 3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

buz

chug

do

v

eich kir

la m

itz

pas

shen

to

f

rela

tive

err

or

ht=2^-17

ht=2^-18

ht=2^-20

Figure 4.7: The relative error as a function of the time step parameter for the Hebrewwords inputs.

Figure 4.7 demonstrates how the signal data sample rate after interpolation influ-

ences the relative error. We may see that it is possible to lower the core sample rate

but the relative error will claim. The hardware model simulations began to suffer

from convergence issues when we increased ht.

67

4.3 Determining the Variables’s Presentation

The cochlea computational model simulated in software is obviously applied by float-

ing point arithmetic. In our C++ program all the variables and parameters are

represented as Double-Precision format. In the Double-Precision format, 64 bits are

partitioned into three parts, S, E and f .

1 bit S 11 bits E 52 bits f

The value of the floating-point number is:

F = (−1)S · 1.f · 2E−1023

Compared to the fixed-point representation, the range of representable floating-point

numbers is larger, but the precision is smaller.

Our goal is to determine the necessary word-lengths for the transformation of the

floating point version of the model into a version suitable for a hardware implemen-

tation as shown in figure 4.8.

The main problem when converting floating point arithmetic to reduced float-

ing point or fixed point arithmetic is the determination of the necessary numerical

precision. This implies the word-length of internal variables and parameters represen-

tations. Therefore, the hardware model was recoded in C++ using a self developed

scalable data type. This data type takes the internal word-length as a parameter and

saves the values exactly in the same format as they would be saved in a register on

an ASIC or FPGA. So numerical effects of imprecise arithmetic can be simulated.

To solve the problem of large word-length, the dynamic range of the signals had to

68

Floating Point

Limiter for upper and

lower bound

quantizer

Reduced Floating

Point

Fixed Point

Figure 4.8: The representation of the variables and parameters flow chart.

be limited by a lower and upper bounds. This implied a change of the dynamic behav-

ior between the reduced floating point or fixed point and the floating point version.

The effects of these changes to the behavior of the whole model was investigated.

In the research work it was discovered that the dynamic range of the variables may

even reach about 1000dB. Figure 4.9 demonstrates the basilar membrane velocity

vbm histogram for all the sections along the basilar membrane obtained from the

calculation of the input signals.

The extremely huge dynamic range of the variables is caused by the numerical impre-

cision in the arithmetic. It was also found out that most of the variables’s histogram

appearances range between a 160 to 200 dB. We can clearly see in Figure 4.9 that

at around −100dB we have a ”knee shape” where fewer appearances happen. This

69

−900 −800 −700 −600 −500 −400 −300 −200 −100 0 1000

5

10

15x 10

4 BM Velocity Histogram

Vbm

[dB]

Figure 4.9: Histogram of the basilar membrane velocity.

form of the graph let us determine the lower and upper bounds shown in Table 4.3

for all of the variables.

Though it is not noticeable in the variables’s histograms, we also discovered that

the dynamic range upon the different sections is not uniform. The sections close

to the apex, which represent low frequency have a reduced dynamic range, where

the sections close to the base have large dynamic range. Figure 4.10 displays the

histograms of the acceleration for three different ranges of sections. We can also see

that the three histograms have different locations. We have overlooked this fact since

we seek a generic representation.

The only method to determine the optimal word-length and to validate the correct

70

−100 −50 0 50 100 1500

0.5

1

1.5

2

x 104 Acceleration Histogram for different sections

Acceleration [dB]

100−110

400−410 250−260

Figure 4.10: Histograms of the acceleration for three different ranges of sections.

function of the model was to simulate various implementations with different word-

lengths in the hardware model and observe the influence of the word-length on the

performance of the application.

Now, as the variables’s dynamic range was set to no more than 200 dB, we have

simulated the representation of the mantissa with different bit resolution. The expo-

nent, which represents the dynamic range of 200 dB can be implemented by 5 bits

only, since

200dB = 20log10(1010) ≈ 20log10(232)

1010 ≈ 232

32 = 25

The Mantissa was simulated for different number of bits starting from 52 and

71

Variables’s Lower and Upper boundsVariable Minimum Maximum Lower bound Upper bound dB

vbm 1.4e− 45 7.38e2 1.5e− 7 1.5e3 200v′bm 1.87e− 53 1.8e7 4.0e− 3 4e7 200ξbm 1.75e− 62 3.47e1 7.0e− 9 7.0e1 200Pohc 2.17e− 60 2.14e7 2.0e− 6 2.0e3 180P ′

ohc 3.69e− 55 4.43e8 2.0e− 3 2.0e7 200g 1.0e− 57 6.0e5 1.0e− 4 1.0e6 200

Pbm 4.41e− 57 7.55e1 1.5e− 8 1.5e2 200

Table 4.3: lower and upper bounds for the cochlea hardware model variables.

down to 22 bits. As seen from figure 4.11, a significant change in the output relative

error is noticeable when the mantissa resolution was changed from 25 to 24 bits. (The

constant multiplicands were represented by 35 bits in the following graph.)

The hardware model variables representation was changed to a reduced floating-

point version where the Exponent is represented by 5 bits and the Mantissa is rep-

resented by 25 bits. This word-length definition was verified in the hardware model

FPGA simulator (Chapter 5).

A fix-point word-length was also investigated. It is assumed that if the cochlea

hardware model variable are bounded to 200 dB then a 32 bits fix-point representation

would be enough. The fix-point version was implemented on the hardware model

with the delta architecture. The dynamic range was primarily limited to lower and

upper bounds. The dynamic range was reduced to the range of 140 to 160 dB as

seen in Table 4.4. The word-length for fix-point was determined for each variable

by simulating various implementations with different word-length until reaching our

very tight restriction not to exceed a 1% relative error. Table 4.4 represents the list

of all the model’s variables. For each variable the lower and upper bounds and the

72

quantizing the parameters

0.00%

0.20%

0.40%

0.60%

0.80%

1.00%

1.20%

1.40%

1.60%

chirp sin click mi tz

rela

tive

err

or

reg

24bit

25bits

30bits

35bits

40bits

0.10%

1.00%

10.00%

100.00%

buz

chug

dov

eich

kir la

mitz

pas

shen

tof

rela

tive

err

or reg

35 bits

25 bits

24 bits 22 bits

Figure 4.11: The relative error for different quantization of the model variables forsynthesized signals and Hab words. The constants are quantized to 35 bits. reg isthe reference for double precision, we use 5x3 configuration.

73

required resolution is set. It is clear from Table 4.4 that the maximum number of

bits are 30.

Fix-Point representationVariable Lower bound Upper bound dB resolution no. of bitsd input 2−17 2−3 80 2−18 21d ξbm 2−34 2−10 140 2−36 26d vbm 2−20 26 140 2−22 28d Pohc 2−24 24 160 2−25 29d g 2−10 216 140 2−12 28

d Pbm 2−24 22 140 2−26 28d P ′

ohc 2−10 219 160 2−12 31d v′bm 2−10 220 160 2−11 31vbm 2−17 211 160 2−19 30v′bm 2−4 225 160 2−6 31P ′

ohc 2−4 224 160 2−6 30

Table 4.4: Fix-Point representation for the hardware model with the delta and 5X3architecture. The constant parameter are represented by 25 bit fix-point.

74

4.4 Timing Analysis

In order to analyze the timing of the system, we begin by introducing the basic

building blocks of our system. Figure 4.12 demonstrates the flow of the system. The

processor, aimed for the hearing-aid devices may also carry out speech enhancement

before a Vocoder in cellular communication. The input data is processed in parallel.

We call such processor a SIMD (single instruction multiple data) processor [18]. The

reconstruction algorithm gathers the data processed in parallel into one output signal.

A/D Processor Reconstruction

D/A

Vocoder

SIMD Processor

Reconstruction

Figure 4.12: A basic flow of the system.

The basic architecture of the processor is shown in Figure 4.13. The processor is

composed from N ALUs, where N is the number of sections, currently N = 448. The

N ALUs run in parallel as the algorithm was fully parallelized. The ALUs preform

two operations, addition and multiplication by constants of 24 bits. The computations

can be done by 32 bit Fix-Point arithmetic or alternatively by a reduced Floating-

Point representation consisting of 25 bit Mantissa and 5 bit Exponent. The constant

values are stored in the memory block located on the right side of figure 4.13. The

75

memory contains 7 constant values of 24 bits for each of the N sections. On the

left side we have the register file which contains the variables data. Under Fix-Point

representation, we keep a 32 bit numbers of the past and current time point values

of the variables. For each of the N sections we use 7 variables. The controller of the

processor manages the processing algorithm.

Reg file 32x2x7

xN

ALU

Constant Param. Table

24x7xN

controller

ALU

Figure 4.13: The processor architecture.

The processor core data frequency is 1/ht, independent to the input sample rate

which usually ranges between 8 to 44.1 KHz. The core samples are applied by an

interpolation on the input signal samples since a higher data resolution is needed. In

our simulations we use ht = 1µSec which leads to about 1 MHz of core data rate.

Simulations showed that ht could be increased up to 7.6µSec which results in core

data frequency of 131 KHz. In order for the hardware design to comply with real-time

application, the workload summarized in Table 3.8 must be executed within the time

frame of ht. If the processing time for a time point takes more than the core sample

rate a stack starts to grow until the processor reaches overflow. Since the algorithm’s

76

stages are causal we derived the critical path of the algorithm (shown in Table 3.10)

for ht as a function of Titer and Piter.

Total latency of critical pathadditions 2TP + 5Tmultiplications TP + T

Table 4.5: Number of operation for critical path.

These numbers of operations, additions and multiplications by a constant value must

be processed within a time frame of ht. In this analysis we do not take into account a

pipeline implementation. We can plan a pipeline architecture built from Titer stages.

The stages are alike. By having Titer stages we can accept a new input sample after

one stage, thus we increase the processing rate by a factor of Titer or one can say that

the critical path can be divided by the factor Titer. A pipeline architecture which

is also called a linear array topology, certainly makes it easier to reach real-time

application but introduces an area problem. In a pipeline architecture the logic is

duplicated and the silicon area is increased. More silicon means higher manufacturing

cost. Notice that the total processing delay of a sampled data does not changes.

We minimize the computational latency by implementing fast addition and high

speed binary multiplication.

4.4.1 Fast Addition

The simplest adder is a ripple-carry adder, but we need to wait for the carry to prop-

agate. Carry Look Ahead adder is the most commonly used scheme for accelerating

carry propagation, which generates all incoming carries in parallel and avoid the need

to wait until the correct carry propagates from the stage (FA) of the adder where it

77

has been generated. It provides a logarithmic speed-up.

The number of gates along the critical (longest) path (in other words, the number

of circuit levels) determines the execution time of the algorithm.

In full custom VLSI technology the exact number of gates has very limited effect

on the implementation cost. Instead, regularity of the design and length of intercon-

nections are considerably more important, since they affect both the silicon area used

by the adder and the design time. In our case, as seen in Figure 4.13, the basic ALU

units are duplicated making the design modular and the number of interconnects are

small since we only have connections between adjacent ALUs. The two factors (i.e.,

implementation cost and speed) do not necessarily achieve their minimum value in the

same design. Thus, a tradeoff between these two might have to be found. There are

many techniques to implement a carry-propagate adder. Some of the popular ones are

Carry Look Ahead adder, Conditional Sum adder, Manchester and the Carry Skip

adder which has recently become popular [23]. In VLSI technology the carry-skip

adder is comparable in speed to the carry look ahead technique while it requires less

chip area and consumes less power.

In multiplication, when three or more operands are to be added simultaneously,

we use Carry-Save addition. In Carry-Save addition, we let the carry propagate only

in the last step, while in all the other steps we generate a partial sum and sequence

of carries separately. Thus, a Carry-Save adder (CSA) accepts three n-bit operands

and generates two n-bit results, a n-bit partial sum and a n-bit carry. A second

CSA accepts these two bit sequences and another input operand , and generates a

new partial sum and carry. A CSA is therefore, capable of reducing the number of

operands to be added from 3 to 2, without any carry propagation. A better way to

78

organize the CSAs, and reduce the operation time, is in the form of a tree commonly

called Wallace tree [14]. In this tree, the number of operands is reduced by a factor

of 2/3 at each level. Consequently,

Number of Levels ≈ log(k/2)

log(3/2)

where k is the number of operands to be summed. In our case, we will need 5 levels

of CSAs and one more CPA to sum 13 operands.

Carry-Save addition can be very useful in our design since we can represent the

variables and store them in a carry save format. The final stage of the CPA (carry

propagate adder) which also takes more time could be omitted. In Figure 4.14 we

uncoiled the hardware algorithm for one time-iteration. The basic building blocks of

adders and multiplier are illustrated.

4.4.2 High Speed Multiplication

There are two ways to speed up multiplication: reduce the number of partial products

or accelerate their accumulation [23]. All multiplication methods share the same basic

procedure, addition of a number of partial products. The simple methods are easy to

implement, but the more complex methods are needed to obtain the fastest possible

speed.

The simplest method of adding a series of partial products is based upon adder-

accumulator. This is relatively slow, because adding N partial products requires

N clock cycles. There is a faster version of the basic iterative multiplier which adds

more than one operand per clock cycle by having multiple adders and partial products

generators connected in series.

79

When a number of partial products are to be added, the adders need not be con-

nected in series, but instead can be connected to maximize parallelism. This requires

no more hardware than a linear array, but does have more complex interconnections.

The time requires to add N partial products is now proportional to logN , so this

can be much faster for larger values of N [8]. On the down side, the extra complex-

ity in the interconnection of the adders may contribute to additional size and delay.

Probably, the single most important advance in improving the speed of multipliers,

pioneered by Wallace [14], is the use of carry save adders, to add three or more num-

bers in a redundant and carry propagate free manner. By applying the basic three

input adder in a recursive manner, any number of partial products can be added and

reduced to 2 numbers without a carry propagate adder. A single carry propagate

addition is only needed in the final step to reduce the 2 numbers to a single, final

product.

All of the different methods of implementing integer multipliers are reduced to two

basic steps. Create a group of partial products (PP), then add them up to produce the

final product. There are number of different methods for producing partial products.

The simplest partial product generator produces N partial products, where N is

the length of the input operands. A recoding scheme introduced by Booth [5] and

also explained in Appendix C, reduces the number of partial products by a factor of

two. Since the amount of hardware and the delay depends on the number of partial

products to be added, this may reduce the hardware cost and improve performance.

In our circuit implementation, we will have to be able to generate the following partial

products:0, X,−X, 2X and −2X where X is the multiplicand. We can obtain these

easily by including circuits for negating and for shifting left by one bit position. In our

80

case, the multipliers are constant 24 bit which could be recoded by Booth’s algorithm

to create 12 partial products only.

Our multiplication operation will be translated to the summation of 12 partial

products of 32 bit which requires 5 levels of Carry-Save adders (CSAs) and one

Carry-Propagate adder (CPA) at the final step in case a single output is wanted.

Using Wallace or other forms of binary trees require at least 4 CSAs in parallel at

the first level. Each of the 4 CSAs has 3 inputs and the total inputs add to 12. This

way a 12 partial products of 32 bit could be added in parallel. The basic block of our

ALU consists of 32× (4 CSAs) and a 32 bit CPA.

The delay of a Carry-Save adder, which is marked as ”CMPR32” in TSMC 0.18µm

process standard cell library [1] is about 0.35 ns, depending on the load capacitance.

The delay is calculated in Appendix D, Equation D.0.2. We assume that the total de-

lay of a multiplier will include 5 levels of Carry-Save adders and one Carry Propagate

adder in some of the cases when a single output is wanted. Since CPA can be im-

plemented in a logarithmic delay such as the Carry-Look Ahead adder, we bound its

number of delay levels to be log2(32) = 5. We bound the total delay of our multiplier

by the following equation:

Tmult = 10 · Tadd (4.4.1)

The number of Additions and Multiplications needed in the critical (longest) path

of the hardware model were given in Table 4.5. By setting the ratio given in Equa-

tion 4.4.1 we compute the total additions in the critical path:

Total Additions = Titer(2Piter + 5) + 10 · Titer(Piter + 1) =

= Titer(12Piter + 15)

This should be done in a time period of ht. Thus, the delay for one addition must

81

not exceed Tadd which is a function of ht, Titer and Piter:

Tadd =ht

Titer(12Piter + 15)(4.4.2)

and the frequency of the addition fadd is set by:

fadd = 1/Tadd

In the case of pipeline architecture we multiply Tadd by Titer.

In Table 4.6 we can see Tadd and fadd for different configurations of the hardware

model as a function of ht, Titer and Piter.

Tadd and fadd for different hardware model configurationsConfiguration N ht[Sec] Titer Piter Tadd[ns] fadd[MHz] avg. rel. err.[%]

1 448 2−20 3 5 4.24 235.93 0.52 448 2−20 3 3 6.23 160.43 1.23 448 2−20 2 3 9.35 106.95 1.44 448 2−18 3 5 16.95 58.98 1.85 448 2−17 3 5 33.91 29.49 6.46 448 2−17 2 3 74.80 13.37 12

Table 4.6: Tadd and fadd for different hardware model configurations.

To conclude, the delay of one addition must not exceed at the worst case Tadd = 4.24 ns.

This is certainly feasible as we saw that the delay of a 3 to 2 adder (”CMPR32”) or

a FA is ∼ 0.35 ns.

Our analysis is related on the data of a 0.18µm process library. Today, semicon-

ductor companies such as INTEL and IBM are moving into smaller, faster and less

power consuming technology such as 0.09µm and 0.13µm. We can assume that the

performance, timing and power consumption of our hardware design would improve

by about 20 to 30% with the new technology.

82

r e g i s t e r s s e c t i o n i





Add

er(2

,1)

Con

stan

t m

ult.

Add

er(6

,1)

Con

stan

t m

ult.(

x,2)

Add

er(2

,1)

Add

er(2

,1)

Con

stan

t m

ult.

Con

stan

t m

ult.

Add

er(6

,2)

Add

er(6

,2)


Add

er

Add

er

Add

er

X 4

48

Sec

tions

disp

(n+

1)

velo

c(n+

1)

ohcp

(n+

1)

New

tim

e st

ep

1 m

icro

sec

Eun

it

disp

/LC

g

disp

/LC

p(i+

1) fr

om r

eg

p(i-1

) fr

om r

eg

p(i)

p(i-1

) fr

om r

eg

p(i+

1) fr

om r

eg

acce

(n+1

)

ohcp

(n+

1)

velo

c(n+

1)

disp

(n+

1)

The

add

ition

s ca

n be

mad

e ea

rlier

, exc

ept

for

the

prev

. ac

ce fr

om th

e st

age

befo

re.

disp

(n)

velo

c(n)

ve

loc(

n+1)

velo

c(n)

ac

ce(n

) ac

ce(n

+1)

ohcp

(n)

ohcp

_d(n

) oh

cp_d

(n+

1)

Gun

it P

unit

Dun

it M

Eun

it

25 b

its

man

tisa

5 bi

ts

exp

25

5

Pas

t(n)

N

ext(

n+1)

disp

velo

c ac

ce

ohcp

oh

cp_d

g

pres

sure

X 4

48

Add

er

Con

stan

t m

ult.

Add

er

ohcp

_d(n

+1)

disp

(n+

1)/L

C

velo

c(n+

1)

OR

Con

stan

t m

ult.

Con

stan

t m

ult.

velo

c(n+

1)

disp

(n+

1)

Add

er

ohcp

(n+

1)

ohcp

_d(n

+1)

Nee

d to

ad

d.

disp

(n)

velo

c(n)

velo

c(n)

acce

(n)

ohcp

(n)

ohcp

_d(n

)

X5

X2

Figure 4.14: Asic design uncoiled.

83

4.5 Power Consumption Analysis

The issue of Power Consumption is most important when we talk about portable

hearing-aids or mobile communication. The resource of energy in portable devices is

limited. In this section we analyze the power consumption of our processor.

Power dissipation is dependent upon the power-supply voltage, frequency of opera-

tion, internal capacitance, and output load. We used TSMC 0.18µm Process standard

cell library in order to calculate the power consumption of a 3 to 2 counter called

”CMPR32” which is like a FA. The standard cell library is designed to dissipate only

AC power. The power dissipation is primarily a function of the switching frequency

of the design’s internal nets. These nets include the inputs and outputs of each cell

and the capacitive load associated with the output of each cell. The power dissipated

by each cell according to TSMC 0.18µm process [1] is:

Pavg =x∑

n=1

(Ein · fin) +

y∑n=1

(Con · V dd2 · 1

2fon) + Eon · fo1 (4.5.1)

where,

• Pavg = average power (µW ).

• x = number of input pins.

• Ein = energy associated with nth input pin (µW/MHz).

• fin = frequency at which the nth input pin changes state during the normaloperation of the design (MHz).

• y = number of output pins.

• Con = external capacitive loading on the nth output pin, including the capac-itance of each input pin connected to the output driver, plus the route wirecapacitance, actual or estimated (pF).

• Vdd = operating voltage.

84

• fon = frequency at which nth output pin changes state during the normal op-eration of the design (MHz).

• Eos = energy associated with the output pin for sequential cells only (µW/MHz).

In order to calculate the power consumption of the cell ”CMPR32”, we used equa-

tion 4.5.1 and the data from [1]. We assumed that fin and fon are the same and equal

half of fadd, statistically.

Pavg = (0.1126 + 0.1450 + 0.06)fin + (0.005 · 1.82 · 12· fon) =

= 0.167 · fadd(MHz) [µW ](4.5.2)

The average power consumption of ”CMPR32” is 0.167 [µW/MHz].

We examined the issue of power consumption in two approaches. Our first ap-

proach is from the implementation point of view. In the previous section we found

out the basic components of our ALU and its frequency, fadd (Equation 4.4.2). The

approximation of the combinatorial logic power consumption is derived from the av-

erage number of adders working at frequency fin = fadd times the operand’s number

of bits (32) and the number of sections N . As discussed in the previous section, the

first level of the multiplier consists of 4 CSAs and the fifth level consists of 1 CSA.

We chose the average number of adders working at the same time for 1 bit to be 2

adders.

Pavg = 0.167 · fin(MHz) ·N · 32bit · 2csa [µW ] (4.5.3)

As seen from Equation 4.5.3, Pavg is a function of N and fin which is a function of

ht, Titer and Piter. We calculated the power consumption of the processor for different

configurations using Equation 4.5.3 as seen in Table 4.7. We can see that different

configurations effect the power consumption between the range of 1.13 to 0.06 Watts.

We must take into account that the relative error grows when less iterations are

85

Pavg for different hardware model configurationsConfiguration N ht[Sec] Titer Piter fadd[MHz] Pavg[W ] avg. rel. err.[%]

1 448 2−20 3 5 235.93 1.13 0.52 448 2−20 3 3 160.43 0.77 1.23 448 2−20 2 3 106.95 0.51 1.44 448 2−18 3 5 58.98 0.28 1.85 448 2−17 3 5 29.49 0.14 6.46 448 2−17 2 3 13.37 0.06 12

Table 4.7: Power consumption for different hardware model configurations.

y = 0 . 6639 x - 0 . 9071

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2 4 6 8 10 12 14

Average relative error (%)

Est

imat

ed P

ow

er

con

sum

pti

on

(W

)

first appr. second appr. pattern

Figure 4.15: Estimated Power consumption vs. relative error.

86

applied. Figure 4.15 demonstrates the relationship between the estimated power

consumption and the relative error of the hardware model simulation depending on

its configuration.

Our second approach to approximate the processor’s power consumption is by

looking at the total work-load which needs to be processed in a time period of ht

by 32 Full-Adders. We know from Table 3.8 that the total number of additions and

multiplications is:

Additions (2PiterTiter + 12Titer − 3)(N − 1)Multiplications (TiterPiter + 4Titer)(N − 1)

We assume the multiplications are done serially by Multiply-Accumulator (MAC) and

it takes 12 steps to sum 12 partial products by 32 FA. Using the following assumption:

Tmult = 12 · Tadd (4.5.4)

we compute the theoretical frequency and the power consumption:

ftheo = [(12T + 2PT − 3)N + 12N(4T + TP )]/ht [MHz]

Pavg = 32 · 0.167 · ftheo(MHz) [µW ](4.5.5)

Using the second approach reveals almost the same average power consumptions for

all the different hardware model’s configurations as seen in figure 4.15.

Chapter 5

FPGA Design and Simulation

In this chapter we present an implementation of the hardware model algorithm in

a field programmable gate array (FPGA) [2] simulator since an ASIC design was

impossible to do within the university framework.

The FPGA technology enjoys the following advantages over analog and digital

VLSI (ASICs): shorter design time, faster fabrication time, more robust to power

supply, temperature and transistor mismatch variations, wider dynamic range and

higher signal to noise ratios, better stability, the chips can be reused for different

application and it has a simpler interface.

5.1 The Design

The design architecture that was chosen for the implementation of the hardware

model algorithm was dictated by the FPGA size. Since the number of the FPGA’s

cells are limited, it was impossible to create a fully parallelized architecture which

implemented 448 ALUs in parallel. The 448 sections were divided into 14 segments

of 32 sections each. Our implementation included 32 ALUs in parallel which could

compute only 32 sections at a time. The ALUs are able to Multiply, Add and Shift.

87

88

Each clock cycle one of the operations are done. There are no dead cycles. It was

important for us that the design would be generic and that changes in parameters

could be easily done. The design works in SIMD (Single Instruction Multiple Data)

topology. The ALUs operate upon a configured instruction code. The FPGA design

based on parallel ALUs are shown in figure 5.1 [6].

REG bank A

(16x448 /ALU_NUM)

x (ALU_NUMx

Sample Width)

ALU

REG bank B

(16x448 /ALU_NUM)

x (ALU_NUMx

Sample Width)

PROGRAM

CLK

G Edge manipulation

P Shift P Shift

ALU_NUM x Sample

Width

Figure 5.1: A scheme of the ALU architecture implemented in the FPGA.

We hold the 16 parameters and variables for each of the 448 section in two Register

Banks. The contents of the Register Banks are displayed in Table 5.1.

89

Address Left Memory Bank Right Memory Bank

0 Tmp2 Tmp11 Prev veloc Veloc2 Prev disp Disp3 Prev acce Acce4 Prev ohcp Ohcp5 Prev ohcp derv Ohcp derv6 P P7 −K/2 (const. vector) G8 −RK (const. vector) ht (constant)9 −K/C (const. vector) 220

10 −1/dx2 (constant) ht/2 (constant)11 −1/(dx2 + K) (const. vector) −21

12 −w0

13 −Kohcw1 (const. vector)14 Kohc

15 0 (const)

Table 5.1: The contents of the FPGA Register Banks.

The number of clock cycles that are computed for a time sample computation are

calculated by the following expression:

Clock Cycles = 448ALU NUM

(PreE + E + TiterPiter × P+

+Titer(G + D) + (Titer − 1)ME) (5.1.1)

where, Piter and Titer are the Punit and Time iterations, respectively. ALU NUM is

the number of ALUs placed in parallel, E,G, P, D, ME and PreE are the numbers

of Clock Cycles (instructions) of each unit in the algorithm. The number of Clock

Cycles (CCs) depends on the amount of instructions for every unit in the algorithm.

The ALU commit one instruction every clock cycle. The complete instruction code

and their number of instructions are displayed in Appendix E. We summarized the

90

number of Clock Cycle per unit in Table 5.2.

Number of Clock Cycle per unitUnit Clock CyclesEunit 6Gunit 5Punit 4Dunit 9

MEunit 9Pre-Eunit 5

Table 5.2: Number of instruction per unit in the FPGA design.

Using Equation 5.1.1 and Table 5.2, the number of Clock Cycles that are needed for

our basic configuration, where ALU NUM = 32, Titer = 3, and Piter = 5 is 1834 CCs.

The Period of the clock cycle was determined by the longest instruction (operation).

If the ALU NUM was 448 then it would take 131 CCs to compute a time point.

In this case, real-time application needs to run at a frequency of 131 MHz when,

ht = 1µSec.

Figure 5.2 represents a waver of the controller state machine which controls the

ALUs. The main states of the state machine are the basic units of the hardware

Figure 5.2: A waver of the state machine of the FPGA design.

algorithm, Eunit, Gunit, Punit, Dunit, MEunit, and PREEunit. Each state

91

is composed from several instructions, as explained in Appendix E. The instruc-

tion number being committed is determined by the InstructionNum counter. The

RepetitionNum represents the number of segment between 0 and 448/ALU NUM

being processed. The parameters time-iteration (Titer) and punit-iteration (Piter) are

also displayed.

The reduced floating point definition that was determined in the hardware model

was used for the design of the FPGA. The parameters and variables are represented

by 25 bit Mantissa and 6 bits Exponent.

5.2 Simulation Results

The cochlea implementation was verified by comparing the results of the VHDL sim-

ulator with the results of the hardware model in C using the reduced floating point

representation. The input signals that were tested in the simulations were a sinus at

4KHz and a chirp starting from 500Hz up to 8KHz. The input signals where applied

at a rate of 1 Mega samples per second prepared beforehand.

We calculated the relative error between the VHDL and C implementations using

equation 4.0.1. Table 5.3 summarizes the relative errors for the tested signals.

signal relative error[%]chirp 0.0551sin4k 0.0462

Table 5.3: C vs. VHDL implementations relative errors.

Figure 5.3 demonstrates the energy of the output signal per section for the chirp input

92

signal. The energy is calculated by:

Energy(x) =T∑0

v2bm(x, n) (5.2.1)

where vbm is the output signal matrix. x represents the section number and T repre-

sents the signal’s period of time (number of samples). So close is the fit of the FPGA

0 50 100 150 200 250 300 350 400 450−150

−100

−50

0

50

100Signal Energy vs. section

section number

Ene

rgy

[dB

]

green,solid − C implementation

blue,dotted − VHDL implementation

Figure 5.3: The energy of the output for different sections for the chirp input.

result to the hardware model written in C, that one can scarcely tell that the two

lines are plotted rather than one. At about section 350 we can see a rapid drop in

the values of the energy in the C implementation. This change occurs since the upper

and lower limiters implemented in the C code were not inserted to the FPGA design.

Thus, at the higher sections, which represent the lower frequencies, the amplitudes

are low and for many time-points the C hardware model zeros the velocity values for

these sections.

The implementation of the hardware model definition in the FPGA simulator was

93

verified against the C implementation. We saw that the plotted lines of the VHDL

and C fit together. The slight differences between the two implementations are caused

by the floating point rounding scheme and lower bound limiter applied in the C code.

In order to check the feasibility of the implementation in FPGA we synthesized

the VHDL code. The FPGA is composed from basic logical cells called LUTs (Look

Up Tables) and FFs (Flip-Flops). Recent FPGAs were added specific design mod-

els such as multiplier (18x18 bits), MACs (Multiply-Accumulator), Memories, and

Embedded CPUs (ARM in Altera and PowerPC in Xilinx). An estimation of the

hardware resource for an adder and multiplier, displayed in Table 5.4, was derived by

synthesizing these components.

Component Logic note

Adder 1316 LCs LC=LUT+FF, FF-not in use.Multiplier(25x25) 4 MULT(18x18), 590 LCs using special purpose multiplier.

Table 5.4: Amount of logic needed for an adder and multiplier in FPGA.

In order to implement an architecture where the ALU NUM = 448, 448 multipli-

ers of 25x25 bit are required. In two of the most updated FPGAs, Xilinx Virtex2Pro

(XC2VP100) and ALTERA Stratix2 (EP2S180) there are 111 and 96 multipliers, re-

spectively. For the algorithm to run in real-time we either must have 4 − 5 FPGAs

in parallel or multiply the FPGA’s basic frequency which is 131 MHz by 4 or 5. The

floating point multiplier model in Xilinx Virtex2Pro family may work at 100 MHz,

but since we need a 25x25 bit multiplier, it demands two levels of multipliers, thus

working at a rate of only 50 MHz. It seems that a real-time implementation of the

hardware algorithm in one FPGA is not feasible for now, nevertheless, it may be

94

feasible with the implementation of 4 to 5 FPGAs and the improvement of the com-

putation rate, mainly by improving the FPGA design. One idea for the improvement

of the design is the use of pipeline.

5.3 Summary

A VHDL implementation for the hardware model has been designed using an instruc-

tion queue which runs an array of ALUs. The design was simulated and verified in a

FPGA simulator. The simulations results were compared to the hardware model in

C using the reduced floating point representation. The VHDL code was synthesized

in order to check its feasibility to run at real-time. For the moment, it is possible to

implement a FPGA cochlea model not in real-time. A real-time application could be

made possible using more FPGAs and by improving the design.

Chapter 6

Discussion

The processing of the cochlear representation introduces massive computations in

the speech enhancement system. In this study, we have focused on designing the

main block (the processor) which accounts for most of the computational load. Al-

though other blocks, such as the interpolation and reconstruction blocks, participate

in the processing sequence, their addition to the time delay and power consumption

is negligible.

In this research, the solution of the one-dimensional cochlear model [4] was modi-

fied to best fit hardware design implementation. A parallel solution for the algorithm

was introduced and evaluated. Due to the fact that the hardware solution consists

from two iterative numerical methods, we had several configurations possible, obtain-

ing a relative error of less than 1% compared to the original algorithm, on a set of

tested stimuli.

We have narrowed down the number of configurable parameters to only four:

ht, N, Titer and Piter. The computational resolution is defined by ht and N , which

represent the processor’s core data sampling period and the membrane’s resolution

(number of sections), respectively. The computational number of iterations of the

95

96

numerical methods are represented by Titer and Piter. The convergence and truncation

errors of the numerical methods depend on the number of the iterations. Obviously,

when more iterations are applied, the relative output error decreases, but the latency

and work-load increases. The parameters ht, N and Titer were also valid for the

software solution of the algorithm. The parameters ht and Titer were fixed in the

hardware solution. The timing performance (processing latency), power consumption

(total work-load) and functional correctness (relative output error) are determined

from the configuration.

Each configuration was evaluated using three parameters: the error relative to the

original solution, the clock frequency and the power consumption. These three criteria

were the most important and acute for our design consideration. We estimated that

a clock frequency of up to 250 MHz, and a power consumption ranging between 0.06

to 1.13 Watts, would be needed for achieving a reasonable functional performance in

real-time.

We used TSMC’s 0.18µm process standard cell library databook as a reference.

This technology is not the most up to date, and there are newer technologies, such as

0.09µm, which would improve the timing, power and area size by about 20 to 30%.

There are many other criteria which were not considered. For example, the com-

plexity, size and layout of a chip. Modular design is preferable, because it is less

complex to implement and usually a more area efficient. In our architecture, we have

designed the N ALU blocks to be laid-out in parallel making the design modular and

easier to layout. The total area was not computed since it’s technology dependent.

The issue of silicon interconnects play a major role in design architectures nowadays,

97

where the transistors get smaller and faster. It is very hard to forecast wire con-

gestion problems but we certainly minimized the problem with our parallel solution,

since there is only communication between two adjacent ALUs.

The system could be designed as a system on a chip (SoC) or could be partitioned

into couple of independent components. The IO interface is an important issue in

chip design. A chip can be pad-limited or core-limited. When a chip is considered

to be pad-limited, a given silicon area cannot contain all of the IO interface (pads).

Therefore, the integration of the reconstruction algorithm with the cochlear model

algorithm to one chip, is preferable. The reconstruction algorithm uses the cochlear

model output, while performing a multiply and add operations on the matrix columns,

at no more than 50 kHz, output rate.

The total tolerable delay in speech applications is about 25 ms. Our bottle neck

is the high core data rate of the processor, represented by ht, which is about 1µSec.

Considering real-time performance, the processor time delay would be ht, enabling

sufficient time of (0.025− ht) Sec for the rest of the system’s operations.

Different criteria exist for evaluating various speech applications such as: hearing-

aids, cellular communication and voice recognition. The common evaluation tests

applied today are: hearing tests, Mean Opinion Score (MOS) [38] which is a hear-

ing test where a listener grades speech quality on a scale of 1 to 5, and Perceptual

Evaluation of Speech Quality (PESQ) [39] which is a software that imitates the MOS

testing for telecommunications. In our work, we set the most stringent restriction,

where no more than 1% of relative error between the hardware and the original out-

put matrixes was allowed. In this case, we relayed on the correctness of the original

software solution.

98

The evaluation of the reconstructed speech signals for the original and hardware

model was done using hearing tests. We haven’t evaluated these output signals using

MOS or PESQ testing methods, yet. The tested stimuli was composed from synthetic

signals and a set of recorded Hebrew words. Although recognizing monosyllabic words

in hearing tests is harder compared to meaningful sentences, the addition of sentences

to the hearing test should be considered.

As it seems, there are still many open questions regarding the correct evaluation

criteria of the original speech enhancement system including the comparison between

the hardware and the software model. We suggest that future work would compare the

different hardware configurations using other testing criteria. It is possible, that for

different applications with different evaluation criteria, we would discover that fewer

iterations, with a reduced number of sections and data rate would be sufficient. Thus,

the matrix relative error may not be the only criteria to evaluate and decide which

configuration is best for our processor. Moreover, we can have different configurations

for various applications. Each configuration results in different timing performance

and power consumption.

The hardware model was designed for field programmable gate array chip (FPGA),

using an ALU oriented architecture (Chapter 5). The design was synthesized but real-

time performance was not satisfied under the FPGA limitations. Other hardware

implementations were also considered. We had turned to ASIC design (Chapter 4),

where fewer design limitations are considered, and the design is more diversified.

This research was the first step towards the implementation of a newly pro-

posed speech enhancement algorithm. We have discovered the drawbacks of the

algorithm relating to its hardware implementation, where timing performance and

99

power-efficiency are a design consideration. The large number of partitions, N , and

the high computational rate obtained by 1/ht introduced massive commutations.

These drawbacks should be examined and noticed in the development of the next

version of the algorithm. Future research should consider new mathematical ap-

proach, using bank of filters (FIR) instead of the current numerical iterative solution.

It could make the algorithm more attractive for hardware implementation. Future

studies should include an examination of the algorithm (original and hardware) with

a broader database and examine the reconstructed speech signals using different eval-

uation techniques.

Appendix A

List of Symbols and parameters

List of symbolsSymbol Definition Units

P Pressure across the cochlear partition kg/m · sec2

Pt(x, t) Pressure in scala tympani kg/m · sec2

Pv(x, t) Pressure in scala vestibuli kg/m · sec2

Ut(x, t) Scala tympani fluid velocity for unit area m/secUv(x, t) Scala vestibuli fluid velocity for unit area m/sec

ρ Perilymph density kg/m3

ξbm(x, t) Basilar membrane vertical displacement mA(x) Scalae cross section area m2

β(x) Basilar membrane width mPohc OHC Pressure contribution kg/m · sec2

Pbm Pressure obtained by basilar membrane kg/m · sec2

m(x) Basilar membrane mass per unit area kg/m2

r(x) Basilar membrane damping per unit area kg/m2 · secs(x) Basilar membrane stiffness per unit area kg/m2 · sec2

ψ Basolateral membrane voltage drop voltψ0 equivalent electrochemical gradient voltGa OHC Apical membrane conductance 1/ohmCa OHC Apical membrane capacitance amp · sec/volt

100

101

List of symbolsSymbol Definition Units

Gb OHC Basolateral membrane conductance 1/ohmCb OHC Basolateral membrane capacitance amp · sec/volt

∆`ohc OHC Elongation mFohc OHC Force applied to the basilar membrane kg/m · sec2

Kohc OHC Stiffness kg/sec2

γ Relative density of healthy OHC’s per unit area 1/m2

K0 OHC load stiffness kg/m2 · sec2

ωcf charactrristic angular frequency Rad/secZ impedance kg/m2 · sec

Table A.1: List of symbols

Appendix B

Mathematical Methods

The solution of the algorithm is performed in two steps. The first step is the solution

of a boundary condition problem in the spatial domain using the finite difference

method and the second step is the solution of an initial condition problem in the time

domain using Euler and Modified Euler methods.

B.1 The Finite Difference Method

We use the finite difference method to solve the second degree differential equation

with the boundary condition introduced in equation 2.2.7,

∂2P

∂x2−QP = QG (B.1.1)

and the boundary condition:

P (x, t) = S(t) x = 0

P (x, t) = 0 x = `

The basilar membrane is partitioned uniformly and the net point is described as:

xi = ihx hx = `N

i = 0, 1, · · · , N (B.1.2)

The natural three-point approximation to the second derivative in x is:

∂2P (x, t)

∂x2≈ P (x + hx, t)− 2P (x, t) + P (x− hx, t)

h2x

(B.1.3)

102

103

Defining pi as Pxiand substituting the approximation equation B.1.3 in equation B.1.1

gives:

pi−1 − (2 + h2xQi)pi + pi+1 = h2

xQiGi (B.1.4)

where Qi = Q(xi) and Gi = G(xi). The exterior points equations are:

p0 = S(T )

pN = 0(B.1.5)

Equation B.1.4 and equation B.1.5 are combined and displayed as a linear system:

Ap = B (B.1.6)

where,

p =

p0

p1

...

pN−1

pN

B =

S(T )

0...

0

0

+ h2x

0

G1Q1

...

GN−1QN−1

0

(B.1.7)


A =

1 0

1 −(2 + h2xQ1) 1

. . . . . . . . .

1 −(2 + h2xQN−1) 1

0 1

(B.1.8)

Matrix A is time independent, it is a constant matrix. As Qi > 0 for i = 1, · · · , N−1,

matrix A has a dominant diagonal and it is a regular matrix [31]. A unique solution

104

to the linear system exists. Matrix A is a tridiagonal matrix factored into two bi-

diagonal matrixes, ([15] page 55).

A =

α0

1 α1

. . . . . .

1 αN−1

1 αN

1 γ0

1 γ1

. . . . . .

1 γN−1

1

(B.1.9)

where:α0 = −(1 + h2

xQ0

2)

αi = −(2 + h2xQi)− γi−1 i = 1, 2, · · · , N − 1

γi = 1αi

i = 0, 1, 2, · · · , N − 1

αN = 1− γN−1

(B.1.10)

Now, the solution of the linear system is done in two steps. An intermediate vector

n is defined and obtained from the following system:

α0

1 α1

. . . . . .

1 αN−1

1 αN

n0

n1

...

nN−1

nN

=

B0

B1

...

BN−1

BN

(B.1.11)

The vector n is obtained by the recursion formula:

n0 = B0

α0

ni = Bi−ni−1

αii = 1, 2, · · · , N

(B.1.12)

Finally, in order to obtain the basilar membrane pressure p, the next system is solved:

1 γ0

1 γ1

. . . . . .

1 γN−1

1

p0

p1

...

pN−1

pN

=

n0

n1

...

nN−1

nN

(B.1.13)

105

The vector n is obtained by the recursion formula:

pN = nN

pi = ni − γipi+1 i = N − 1, N − 2, · · · , 1, 0(B.1.14)

The solution of the linear system computed by the factorization of A is a good

analytical solution for the boundary condition problem. As seen, the solution of

the intermediate vector n and the solution of the desired vector p is done recursively.

Thus, it takes 2N +2 steps to complete the pressure vector p computation. Moreover,

each step requires 2N + 1 multiplications and 2N additions.

B.2 Initial condition problem numerical solution

The cochlea model variables vbm, ξbm and Pohc must be approximated at the beginning

of each time point computation. The variables: P, vbm, ξbm and Pohc along the cochlea

partition at all time points when t ≤ T are available. The cochlear model variables:

v′bm and P ′ohc are also computed and are known for the sections at the time point

t = T . The numerical methods used to approximate these model’s variables are Euler

and Modified Euler methods. These methods are simple and computation efficient.

As the magnitude of the model’s variable undergo significant changes during the

computation process, it takes several iterations for the algorithm to converge. In the

first iteration we use Euler method to approximate roughly the variables: vbm, ξbm

and Pohc for the time point. Then, from the second iteration we use the Modified

Euler method to approximate these variables.

B.2.1 Euler Method

We define the initial value problem:

y′(t) = f(t, y(t))

y(t0) = y0

(B.2.1)

106

where f(t, y) is a given function, t0 is a given initial time and y0 is a given initial

value for y. The unknown in the problem is the function y(t).

The Euler method is very simple. It uses the first derivative to determine (ap-

proximate) the next time step, as seen from the equation:

y(tn+1) ≈ yn + f(tn, yn)ht (B.2.2)

The parameter ht is called the time step size. The value of ht may be changed by the

convergence unit (Cunit) which evaluates the approximation error. The truncation

error of the Euler method is O(h2t ). The computation of the Euler equation requires

only one multiplication and one addition.

B.2.2 Modified Euler Method

The Modified Euler method is classified as Runge-Kutta method of order two. It is

also called the trapezoidal method.

By the fundamental theorem of calculus and the differential equation, the exact

solution of the initial value problem stated in equation B.2.1 obeys:

y(tn+1) = y(tn) +∫ tn+1

tny′(t)dt

= y(tn) +∫ tn+1

tnf(t, y(t))dt

(B.2.3)

The algorithm for computing yn+1 will be of the form:

y(tn+1) = y(tn) + approximate value for

∫ tn+1

tn

f(t, y(t))dt

In Euler’s method, we approximate f(t, y(t)) for tn ≤ t ≤ tn+1 by the constant

f(tn, yn). Thus,

Euler′s approximate value for

∫ tn+1

tn

f(t, y(t))dt =

∫ tn+1

tn

f(tn, yn)dt = f(tn, yn)ht

The area of the complicated region 0 ≤ y ≤ f(t, ϕ(t)), tn ≤ t ≤ tn+1 which is the

area under the parabola in figure B.1 is approximated by the area of the rectangle

0 ≤ y ≤ f(tn, yn), tn ≤ t ≤ tn+1 (the shaded rectangle in the right half of figure B.1).

107

Figure B.1: Euler and Modified Euler approximation method.

The Modified Euler method gets a better approximation by attempting to ap-

proximate by the trapezoid on the left above rather than the rectangle on the right.

The area of the trapezoid is the length ht of the base multiplied by the average,

12[f(tn, ϕ(tn)) + f(tn+1, ϕ(tn+1))], of the heights of the two sides. Thus, the solution

of modified euler is:

y(tn+1) ≈ y(tn) +ht

2[f(tn, y(tn)) + f(tn+1, y(tn+1))] (B.2.4)

The truncation error of the Modified Euler method is O(h3t ). Equation B.2.4 is an

implicit equation since y(tn+1) appears on both sides. In order to solve equation B.2.4

we define an iterative series {wj}:

w0 = y(tn) + htf(tn, y(tn))

wj = y(tn) + ht

2[f(tn, y(tn)) + f(tn+1, wn)]

(B.2.5)

The first element in the series is obtained by Euler method. The other elements in the

series are obtained by Modified Euler method. The series defined in equation B.2.5

is repeated until convergence is obtained. The Modified Euler method requires two

additions and one multiplication.

Appendix C

Booth Recoding

This technique is used to reduce the number of partial products that must be added

together in a multiplier. Various forms of Booth recoding techniques have been pro-

posed. Consider one variation (called either modified Booth, radix-4 Booth or Booth-2

recoding) that examines groups of 3 adjacent bits of the multiplier operand. Assume

that the number of bits, n, in the multiplier operand Y is an even number. The algo-

rithm uses two’s complement representation of signed numbers. The significands are

positive numbers, so they can be written in two’s complement format by prepending

a 0 MSB. (If we stated with an even number of significand bits, then we will have to

prepend two 0 bits to keep the number of bits an even number.) Then, Since the bit

yn−1 = 0, we can write:

Y =n−1∑

k=0

yk2k (C.0.1)

Each term in the summation can be written as:

yk2k = (2yk − 1

22yk)2

k = yk2k+1 − 2yk2

k−1 (C.0.2)

We use the above expansion for the odd-k terms in the summation for Y and obtain:

Y = (yn−12n − 2yn−12

n−2) + yn−22n−2 + (yn−32

n−2 − 2yn−32n−4) + yn−42

n−4 + · · ·· · ·+ (y32

4 − 2y322) + y22

2 + (y122 − 2y12

0) + y020

(C.0.3)

108

109

We define y−1 ≡ 0 and collect together the terms having the same power of two:

Y = yn−12n + (−2yn−1 + yn−2 + yn−3)2

n−2 + (−2yn−3 + yn−4 + yn−5)2n−4 + · · ·

· · ·+ (−2y3 + y2 + y1)22 + (−2y1 + y0 + y−1)2

0

(C.0.4)

The first term drops out since yn−1 = 0. We define a new set of coefficients zk for

k = even number:

zk ≡ −2yk+1 + yk + yk−1 , for k = 0, 2, 4, · · · , n− 2 (C.0.5)

We can write a new expression for Y and the product, P = XY , in terms of zk as

follows:

Y =n−2∑

k=0,k even

zk2k , P = XY =

n−2∑

k=0,k even

(zkX)2k (C.0.6)

Thus, we have to generate and sum only n/2 partial products zkX, which is ap-

proximately half of the original number of partial products ykX. This represent a

considerable savings in the required hardware. In our circuit implementation, we will

have to be able to generate the following partial products:0, X,−X, 2X and − 2X.

We can obtain these easily by including circuits for negating and for shifting left by

one bit position.

Appendix D

Delay Calculation

The propagation delay through a cell is the sum of the intrinsic delay, the load depen-

dent, and the input-slew dependent delay. The typical delay calculation (@1.8V, 25◦C)

through standard cells according to TSMC 0.18µm process [1] is:

tTPD = ttypical = tintrinsic + (Kload · Cload) (D.0.1)

where,

• tintrinsic = delay through the cell when there is no output load (ns).

• Kload = load delay multiplier (ns/pF).

• Cload = total output load capacitance (pF).

In order to calculate the propagation delay through the cell CMPR32, which is a

counter 3 to 2 or FA we use equation D.0.1 and the data from [1]. We used the Cload

of the cell BUFX2.

tTPD = 0.34 + (4.5 · 0.003) = 0.35 ns (D.0.2)

110

Appendix E

FPGA Instruction Code

The instruction code for the FPGA controller:

Tmp1 = mult(prev veloc, ht) Eunit = 6 CCDisp = add(prev disp, tmp1)Tmp1 = mult(prev acce, ht) (CC=Clock Cycles)V eloc = add(prev veloc, tmp1)Tmp1 = mult(prev ohcp derv, ht)Ohcp = add(prev ohcp, tmp1)Tmp1 = mult(k/2, Ohcp) Gunit = 5 CCTmp2 = mult(RK, V eloc)Tmp1 = add(tmp2, tmp1)Tmp2 = mult(−K/C, Disp)G = add(tmp2, tmp1)In the last instruction edge manipulation= ON g(0)=input, g(447)=0Tmp1 = add(P (i− 1), P (i + 1)), shift=ON Punit = 4 CCTmp2 = mult(−1/dx2, tmp1)Tmp1 = add(tmp2, G)P = mult(1/(2/dx2 + K), tmp1)In the last instruction edge manipulation=ON p(0)=input, p(447)=0

111

112

Tmp1 = add(P (i− 1), P (i + 1)), shift=ON D = 9 CCTmp2 = mult(P,−2)Tmp2 = add(tmp2, tmp1)Acce = mult(tmp2, 220)In the last instruction edge manipulation=ON Acce(0)=0, Acce(447)=0Tmp1 = mult(−w0, Ohcp)Tmp2 = mult(−Kohcw1, Disp)Tmp1 = add(tmp2, tmp1)Tmp2 = mult(Kohc, V eloc)Ohcp derv = add(tmp2, tmp1)Tmp2 = add(prev veloc, V eloc) ME = 9 CCTmp1 = mult(tmp2, ht/2)Disp = add(prev disp, tmp1)Tmp2 = add(prev acce, acce)Tmp1 = mult(tmp2, ht/2)V eloc = add(prev veloc, tmp1)Tmp2 = add(pre ohcp derv, ohcp derv)Tmp1 = mult(tmp2, ht/2)Ohcp = add(prev ohcp, tmp1)Prev disp = add(0, Disp) Pre E = 5 CCPrev veloc = add(0, V eloc)Prev acce = add(0, Acce)Prev ohcp = add(0, Ohcp)Prev ohcp derv = add(0, Ohcp derv)

Table E.1: The Instruction code for FPGA Controller.

Bibliography

[1] TSMC 0.18µm Process 1.8-Volt SAGE-X Standard Cell Databook.

[2] Virtex-2 Pro Platform FPGAs: Complete Data Sheet.

[3] Boothroyd A. Statistical theory of the discrimination score. j. Acout. Soc. Am.,

43:362–367, 1968.

[4] Cohen A. and Furst M. Integration of outer hair cell activity in one-dimensional

cochlear model. j. Acout. Soc. Am, 115:2185–2192, May 2004.

[5] A.D.Booth. A signed binary multiplication technique. Quart. Journ. Mech. and

Applied Math., 4:236–240, 1951.

[6] Ronen Akerman. Implementation of a one dimensional cochlear model in FPGA.

Technical report, Elect. Eng. Dep., Tel-Aviv Univ., 2004.

[7] Oren Bahat. Efficient implementation and complexity analysis of one dimensional

cochlear model. Technical report, Elect. Eng. Dep., Tel-Aviv Univ., 2003.

[8] Gary W. Bewick. Fast Multiplication: Algorithm and Implementation. PhD

thesis, Elect. Eng. Dep., Stanford Univ., California, February 1994.

[9] Geisler C.D. From sound to synapse:physiology of the mammalian ear. Oxford

university press, New York, 1998.

113

114

[10] C.D.Summerfield and R.F.Lyon. ASIC implementation of the lyon cochlea

model. IEEE International Conference on Acoustics, Speech and Signal Pro-

cessing, pages 673–676, 1992.

[11] Azaria Cohen. Cochlear Model For Normal and Damaged Ears. PhD thesis,

Elect. Eng. Dep., Tel-Aviv Univ., Tel-Aviv, Israel, June 2004.

[12] B. Cooper. The Beethoven Compendium:A Guide to Beethoven’s Life and Music.

Thames & Hudson, New York, 1991.

[13] Steele C.R and Tabar L.A. Three-dimensional model calculations for guinea pig

cochlea. j. Acout. Soc. Am., 69:1107–1111, 1981.

[14] C.S.Wallace. A suggestion for a fast multiplier. IEEE Trans. on Computer

EC-13, pages 14–17, Feb. 1964.

[15] Isaacson E and Keller H.B. Analysis of numerical methods. Dover, New York,

1993.

[16] Von Bekesy G. Experiments in Hearing. McGraw-Hill, New York, 1960.

[17] Zweig G, Lipes R, and Pirce J.R. The cochlear compromise. J. Acoust. Soc.

Am., 59:975–982, 1976.

[18] J. Hennessy and D. Patterson. Computer Architecture, A Quantitative Approach.

Morgan Kaufmann Publishers, third edition, 2003.

[19] Helmholtz H.L.F. On the sensation of tone. Dover (the original German edition

was published in 1862), New York, 1954.

[20] J.C.Bor and C.Y.Wu. Analog electronic cochlea design using multiplexing

switched-capacitor circuits. IEEE Transactions on Neural Networks, 7:155–166,

1996.

115

[21] Zwislocki J.J. Theory of the acoustical action of the cochlea. J. Acoust. Soc.

Am., 22:778–784, 1950.

[22] Pickles J.O. An Introduction to the physiology of hearing. 1982.

[23] I. Koren. Computer Arithmetic Algorithms. Prentice-Hall, New Jersey, 1993.

[24] F. Thomson Leighton. Introduction to Parallel Algorithm and Architectures:

Arrays, Trees, Hypercubes. Morgan Kaufmann, San Mateo, 1992.

[25] Furst M and Goldstein J.L. A cochlear nonlinear transmission line model com-

patible with combination tone psychophysics. J. Acoust. Soc. Am, 72:717–726,

1982.

[26] Viergever M.A. Mechanics of the inner ear a mathematical approach. Delft

University of technology, The Netherlands, 1980.

[27] M.Brucke, W.Nebel, A.Schwarz, B.Mertsching, M.hansen, and B.Kollmeier. Dig-

ital VLSI-implementation of a psychoacoustically and physiologically motivated

speech preprocessor. Proceedings of the NATO Advanced Study Institute on Com-

putational Hearing, pages 157–162, 1998.

[28] M.P.Leong, C.T.Jin, and P.H.W.Leong. Parameterized module generator for an

FPGA-based electronic cochlea.

[29] Ranke O.F. Theory operation of the cochlea: A contribution to the hydrody-

namics of the cochlea. j. Acout. Soc. Am., 22:772–777, 1950.

[30] R.F.Lyon and C.Mead. An analog electronic cochlea. IEEE Trans. Acoust.,

Speech, Signal Processing, 36:1119–1134, July 1988.

[31] Burden R.L and Faires J.D. Numerical Analysis , Fifth Edition. PWS Publishing,

Boston, 1993.

116

[32] Wegal R.L and Lane C.E. The auditory masking of one pure tone by another

and its probable relation to the dynamics of the inner ear. Physical Review,

23:266–285, 1924.

[33] Harrison R.V and Hunter-Duvar I.M. An anatomical tour of the cochlea. In:

Physiology of the ear, edited by Jahn A.F and Santos-Sacchi. Ravan, New York,

1988.

[34] S.C.Lim, A.R.Temple, S.Jones, and R. Meddis. VHDL-based design of biolog-

ically inspired pitch detection system. Proceedings of the IEEE International

Conference on Neural Networks, 2:922–927, 1997.

[35] S.Kochkin. Marketrack VI: Hearing aid industry market tracking survey 1984-

2000. www.knowleselectronics.com/market/presentations.asp, 2003.

[36] L. Watts, D.A.Kerns, R.F.Lyon, and C.A.Mead. Improved implementation of

the silicon cochlea. IEEE Journal of Solid State Circuits, 27:692–700, May 1992.

[37] Vered Weisz. Robust cochlear based representation of speech signals: Compar-

ison between healthy and damaged cochlea. Master’s thesis, Elect. Eng. Dep.,

Tel-Aviv Univ., Tel-Aviv, Israel, October 2004.

[38] Methods for subjective determination of transmission quality. ITU-T Rec. P.800

August 1996

[39] Perceptual evaluation of speech quality (PESQ), an objective method for end-

to-end speech quality assessment of narrow-band telephone networks and speech

codes. ITU-T Rec. P.862 February 2001

implementation of the cochlea model in vlsimira/thesis/udi_shtalrid_thesis.pdf · 2005. 5. 22. ·...

Documents